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Synopsis 



Introduction 

Proteins are an important class of biomolecules that serve as essential build- 
ing blocks of the cells. They are structurally complex and functionally one of 
the most sophisticated molecules known. They perform diverse biochemical 
functions and also provide structural basis in living cells. These all-pervasive, 
versatile molecules constitute (barring water) the largest fraction of the to- 
tal mass of the cell. Proteins are macromolecules comprised of thousands of 
atoms. They are characterised by a specific structure which specifies their 
function. 

In the cell, they are synthesised in a complex multi-step process starting 
from DNA to RNA to Protein, thereby giving genetic basis to the protein 
sequence. Chemically, proteins are linear chains composed of (20 types of) 
monomeric molecules called 'amino acids'. These amino acids are linked 
together with a backbone made of peptide bonds. This polypeptide chain 
folds into its unique three-dimensional (3-D) structure, known as the 'native 
state'. How, starting from a linear chain of molecules, a protein attains its 
specific 3-D structure is an unsolved problem in computational biology and 
is known as the 'protein folding problem'. It's a system in which there is an 
inherent 1-D structure in terms of the polypeptide backbone held together by 
covalent peptide bonds. The polypeptide chain folds onto itself by virtue of 
the chemical forces acting among the constituent residues, thereby creating 
noncovalent 'contacts' on various length-scales as specified by the separation 
distance between the contacting residues. These distance scales could be 
loosely defined as short-range and long-range. 
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Proteins perform an array of functions in the cell. They perform these specific 
functions by virtue of their precise structure and chemistry. Structures are a 
critical determinant of their functions. Hence the study of structure-function 
relationship, prediction of structure given the sequence etc., are important 
areas of research. 

Various approaches have been taken for this purpose. Experimentalists have 
been performing biophysical experiments supported with genetics to get an- 
swers to questions pertaining to protein structure-function relationship. The- 
oreticians have believed that with the help of computational power they will 
be able to obtain the answers that have been eluding the experimentalists. 
Within theory, two distinct approaches have been used: Forward and Re- 
verse Engineering. Forward is the traditional way in which one works from 
sequence to structure in a hope to obtain some general results to the protein 
folding problem and other related questions. Reverse Engineering relies on a 
large pool of structural data (such as at Protein Data Bank (PDB)) that is 
made available. It approaches the problem in reverse, as the name suggests, 
and tries to uncover the laws with which the structures were put in place. 

A complex system could be modelled from various perspectives. Complex 
network analyses one way to study a system such as a protein structure. 

Objectives 

In this thesis we use coarse-grained, reverse engineering as our tool and inves- 
tigate experimentally known protein structures in an attempt to gain better 
understanding of the processes by which they were constructed. Specifically 
we focus on following three points. 

• Describe protein structures as 'contact networks' at two different length 
scales-Protein Contact Networks (PCN) and Long-range Interaction 
Networks (LIN) 

• Study the general complex network properties of protein structures. 

• Investigate how different secondary and tertiary structural features of 
proteins reflect in their network properties. 
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• Investigate relation of network properties with biophysical properties, 
such as rate of folding, of proteins. 

Work Plan 

The work plan was as follows: 

• To develop programs for extracting relevant data from the PDB file, to 
develop the code to build and visualise the complex network model of 
protein structures from the extracted data. 

• To develop algorithms and programs for complex network analyses of 
PCNs and LINs. 

• To develop appropriate controls of the networks. 

• To calculate and study the relationship of the general network param- 
eters for PCNs and LINs. 

• To analyse relationship between the network parameters of PCNs, LINs, 
and their controls to identify topological properties of PCNs and their 
relevance in structure-function relationship of proteins of diverse class. 

• To correlate two general network properties (assortativity and cluster- 
ing) with the rate folding of single-domain, two-state folding proteins. 

Results 

In Chapter [2] we lead the reader to the content of the dissertation by pro- 
viding the methods and materials. The data on proteins structures enables 
us to model them with atomic level resolution. But we opt to coarse-grain 
the proteins on two different scales. First, we model protein structures as 
Protein Contact Networks (PCNs) in which the atomic-level details are jetti- 
soned and amino acids are represented as a point situated at their respective 
Ca atoms' coordinates. Noncovalent interactions, responsible for the fold- 
ing and stabihty of proteins, are depicted as spatial contacts and any two 
residues are said to be in contact if they are at a distance of less than or equal 
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to SA. On a coarser level, we model Long-range Interaction Networks (LINs) 
wherein, apart from the backbone, we consider only those contacts in the 
PCN that exist between residues that are distant from each other along the 
backbone. We present computational procedures for creating PCNs, LINs 
as well their random controls. We also present various ways of visualising 
PCN data while highlighting its various features. We define various network 
parameters and illustrate them. Finally we present the data that would be 
used for analyses in the rest of the dissertation. 

In Chapter [3] we investigate the "small-world" nature of PCNs from pro- 
teins of various structural and functional classes. All PCNs, irrespective of 
their classes, showed high clustering (C) and low characteristic path length 
(L) compared to their random and regular control networks. We also show 
that L increases with the logarithm of the size of the PCNs. We emphasise 
the fact that the "small-world" result is a general result and not restricted 
to globular proteins alone as shown earlier. The question, then, follows is 
that whether non-globular proteins such as fibrous proteins too would have 
small-world nature. We investigate this question in this chapter. Other than 
L-C properties, we also investigate degree distributions of PCNs and LINs, 
hierarchical nature of the PCNs and other relevant network features. 

Amongst all the complex network systems studied proteins structures are 
special because of their biological importance. Hence some unique property 
is anticipated in network properties of proteins. We do indeed find such a 
property. In Chapter [4] we discuss this property, assortative mixing in the 
contact networks of proteins at both short and longer length scales (PCN & 
LIN). We show that proteins are assortative in nature, i.e. rich nodes tend 
to make contact with other rich nodes and poor nodes tend to make contact 
with each other. Assortative degree correlations of proteins is an exceptional 
property in the field of complex networks as other networks (except for so- 
cial networks) are known to be of disassortative nature. Since it is known 
that assortative mixing plays a role in information transfer across network, 
it implies that proteins are structurally (and hence functionally) are different 
from other networks. We further explore topological origins of assortativity 
by constructing appropriate controls. Random controls in which the degree 
distribution of the nodes is conserved regain the assortative mixing which 
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otherwise is not there in the null model. This indicates that degree distri- 
bution is a crucial feature that specifies assortativity in proteins. We also 
discuss other possible properties that might be conferred onto proteins by 
virtue of their assortative mixing. 

The fact that proteins have special network property leaves us with more 
questions. Natural selection not necessarily is a causal factor for assortative 
mixing in complex systems. There are network systems of biological origin, 
(e.g. as yeast gene regulatory network, protein interaction network) that have 
been subjected to natural selection, but are known to be disassortative. In 
Chapter [5] we ask: Do biophysical properties have any bearing on network 
properties of proteins? 

For this we chose 30 single-domain two-state folding proteins whose rate of 
folding is available. We notice that as opposed to the clustering coefficients of 
PCNs (CpcAf), which are indistinguishably clustered, those of LINs {Cun) 
are sparse and unique. We show that clustering coefficients of LINs, Cun^, 
are negatively correlated with the rate of folding {Inlkp))- Each protein's 
departure from mean compaction in its LIN is associated to rate of folding: 
the more the departure the faster is the folding. Also, we find that coefficient 
of assortativity of LINs {tun) is positively correlated with the Inikp)- Thus 
we identify two general network property (clustering coefficient and coeffi- 
cient of assortativity) that have negative and positive association with the 
rate of folding of proteins. 

Conclusions 

In this thesis we have investigated the protein structures using a network 
theoretical approach. While doing so we used a coarse-grained method, viz., 
complex network analysis. We found that proteins by virtue of being charac- 
terised by high amount of clustering, are small- world networks. We also found 
that regardless of structural classification all proteins, even fibrous proteins 
have signature of small-world nature. Apart from the small-world nature, we 
found that proteins have another general property, viz., assortativity. This is 
an interesting and exceptional finding as all other complex networks (except 
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for social networks) are known to be disassortative. Importantly, we could 
identify one of the major topological determinant of assortativity by building 
appropriate controls. In our controls the assortativity is partially recovered. 
Small-world nature and assortativity together could be useful in dissipating 
mechanical disturbances across sparsely distributed amino acids. 

The interesting question is if these general network parameters can offer 
any meaningful insight into the specific system properties -the biophysical 
properties of proteins, in this case-which is a naturally evolved intra-cellular 
network. In this thesis we have shown that such correlations can be ob- 
served even at a coarse grained model of protein structures at different length 
scales. Our results indicate that clustering coefficient (Clri) of the LINs of 
the single-domain two-state folding proteins is negatively correlated, and the 
coefficient of assortativity (tlin) are positively correlated with the rate of 
folding of these proteins At PCN level, CpcN show no signifi- 

cant correlation, but rpcN has low but significant association with the rate 
of folding. This indicates that our reverse engineering approach can offer 
significant understanding of the differential role of contact formations ( "fold- 
ing") at different length scales in proteins. We discuss our results in the hght 
of some open questions in modularity in protein structure, folding process 
and evolutionary conservation of residues. 
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Introduction 



1.1 Protein: An Important Biomolecule 

Proteins are an important class of biomolecules and serve as essential building 
blocks of the cell. They are structurally the most complex and functionally 
one of the most sophisticated molecules known. They perform diverse bio- 
chemical functions and also provide structural basis in living cells. Barring 
water, these all-pervasive, versatile molecules constitute the largest fraction 
of the total mass of the cell. 

Chemically, proteins are linear chains composed of monomeric molecules 
called 'amino acids'. These amino acids are linked together with a back- 
bone made of peptide bonds. This polypeptide chain folds into its unique 
three-dimensional structure, known as the 'native state'. Proteins are practi- 
cally involved in every function performed by a cell, such as gene regulation, 
signal transduction, metabolism etc. These functional abilities (listed be- 
low) of the proteins are specified by their detailed three-dimensional (3-D) 
structures. 

Following is a partial list of the roles/functions that proteins, by virtue of 
their structure, are known to be part of. 

• Enzymes (eg.: biological catalysts) 

• Antibodies (eg.: immune system molecules) 

• Regulation (cgs.: transcription, translation) 
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• Messengers (egs.: transmission of nervous impulses, harmones) 

• Transport (eg.: transportation of molecules ranging from electrons to 
macromolecules) 

• Storage (egs.: hemoglobin stores oxygen, iron stores ferritin) 

• Mechanical Support (eg.: structural proteins used in skeletons such as 
collagen ) 



Studying protein structures is not only of fundamental scientific interest in 
terms of understanding biochemical processes, but also produces practical 
benefits. Understanding proteins' structural properties, their relation to 
function (as well as loss of function), folding kinetics, relevance of specific 
(sometimes also referred to as 'hot') residues, and collection of residues (such 
as folding nucleus, active sites), are of considerable importance in biotechnol- 
ogy industry, agriculture, medicine, to name a few. The knowledge gained 
from such an understanding can be put to use for 'protein engineering'. The 
properties of proteins could be modified, enhanced, and in fact proteins of 
novel and desired properties could be designed de novo with better under- 
standing of the areas mentioned above. 

It is important to understand how proteins consistently fold into their native- 
state structures and the relevance of structure to their functions. The fold- 
ing mechanism, kinetics, structure, and function of proteins are intimately 
related to each other. Misfolding of proteins into non-native structures can 
lead to several disorders [l|. Understanding of the folding process will pro- 
vide clues to misfolding and resulting disorders. Correlating sequence with 
structure, as well as understanding of folding kinetics has been an area of 
intense activity for experimentalists and theoreticians Among the 

different theoretical approaches used for studying protein structure, func- 
tion, and folding kinetics, a graph theoretical approach based on perspec- 
tives from coinplex networks has been used recently to study protein struc- 
tures 
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1.2 Protein: A Complex System 

Proteins could be regarded as complex dynamical systems which is reflected 
in their surprisingly fast folding process by which they attain their native- 
state structure. Despite large degrees of freedom, surprisingly, proteins 
fold into their native state in a very short time which is known as the 



Levinthal's Paradox 121]. All the information needed to specify a protein's 



three-dimensional structure is contained within its amino acid sequence 



Given suitable conditions, most small proteins fold to their native states |13 |. 
The spontaneous folding of proteins into their elaborate three-dimensional 
structures, starting from linear chains, is one of the remarkable examples of 
biological self-organisation. 

Not only individual proteins, but multi-protein units, to geth er, are proposed 
to be working as 'computational elements in living cells' [ij]. Many proteins 
that appear to have as their primary function the transfer and processing of 
information, are functionally linked through allosteric or other mechanisms 
into biochemical 'circuits' that perform a variety of simple computational 
tasks including ampliflcation, integration, and information storage [l^ . 



1.3 Complex Network Models: 

A Brief Historical Perspective 



Complex systems, that are characterised by discrete constituents and their 
inter-relationships, have been traditionally studied in the fleld of graph the- 
ory [isl, Q]- Erdos and Renye proposed that connectivity in the large-scale, 
real-world networks are random. For decades this proposition remained un- 
challenged [isl. Systems of high complexity and that coming from diverse 
origins are known to be driven by networks of elements. Living cells, eventu- 
ally, are the outcome of dynamic interactions among various networks such 
as protein-protein interaction, gene regulatory network, signal transduction 
pathways, metabolite networks and such. Many other non-biological systems 
are also amenable to complex network analysis. Few examples of such net- 
works are: Internet HHQ, World-Wide Web HHQ, Software Net- 



work, Power-Grid Network, Transportation (Railways 23|], Airlines 24j. l25l|) 
Networks, Social Networks 26|, |27|, |28j to name a few. Each of these networks 
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is, shaped by physical, spatial (geographical), technological, even political 
influences, that are specific to that system. The question then is, can the 
networks driving such systems be inherently random? It seems logical that 
the processes and dynamics responsible for these networks wire them in a 
non-random fashion. Lately, it has been shown that complex networks are 
indeed non-random 



16, M. 



In recent years, there has been considerable interest [16|, |29|| in structure 
and dynamics of networks, with application to systems of diverse origins 
such as society (actors' network, collaboration networks, etc.), technology 
(world-wide web, Internet, transportation infrastructure), biology (metabolic 
networks, gene regulatory networks, protein-protein interaction networks, 
food webs) etc. These ^ are characterised by some universal properties, such 
as small-world nature 



30, IM] and a scale-free degree distribution [ly, |20] . 



Below we briefly summarise a few of the important network features. 



Small- World Networks 



Erdos and Renye [15| define a random graph as nodes connected by n 
edges which are chosen randomly from the ^'^^^^'> possible edges. But in 
reality, the connections in the networks are not random and are dictated by 
various forces. One way it reflects is that real-world networks have unusually 
high clustering coefficients. These networks with high amount of clustering 
are classified as "small- world networks". Watts and Strogatz S^] visualised 
them as depicted in Figure 11.11 (3 is the edge rewiring probability. Small- 
world graphs are the systems obtained midway between regular {(3 = 0) and 
random {f3 = 1) graphs, when, starting from regular networks, the edges are 
rewired with increasing probability (3. Networks of diverse origins have been 
shown to be having a small- world nature 16, [29]. 



Scale-Free Networks 



While modelling systems as random graphs and the small-world models, the 
emphasis was modelling the network topology. The scale-free model put the 
emphasis on modelling network assembly and evolution. While the goal of 
the former models was to construct a graph with correct topological features, 
modelling scale-free networks put the emphasis on capturing the network 



1.3 Complex Network Models: A Brief Historical Perspective 



5 




Figure 1.1: The Small World Topology. 



dynamics [l6|]. The Scale- Free model is composed of two constituents: 



1. Growth: Starting with a small number (mo) of nodes, at every time- 
step one adds a new node with m(< mo) edges that link the new node 
to m different nodes already present in the system. 

2. Preferential Attachment: When choosing the nodes to which the new 
node connects, one assumes that the probability Yl that a new node 
will be connected to node i depends on the degree ki of node i, such 
that 

Scale-free networks are characterised with power-law degree distribution. 
Modularity 

The concept of modularity assumes that the system's functionality could 



be seamlessly partitioned into collection of modules 32|. Various networks 
that have been investigated so far have been found to be modular in nature, 
where a module is a discrete entity with several elementary components and 
performs an identifiable task, separable from the functions of the other mod- 
ules ybai. 
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Degree Correlations 

A measure that expresses degree-degree correlation feature of the network is 
assort at ivity. It exhibits whether, in a network, nodes with poor degrees tend 
to connect to those with poor degrees, or, those with rich degrees. A network 
can be assortative, dominated by rich-to-rich node connections, or it could 
be disassortative with more rich-to-poor connections. A random network has 
no preferred degree correlation tendencies. It is known that most real-world 
networks (except for social networks) are disassortative |28l] and the origin 
of disassortativity in real-world networks is listed as "one of the ten most 



leading questions for network research" 3J|. This property has been shown 



to be having a bearing over the percolation threshold of the network 28| 



Some of the leading questions in complex network research 

Apart from origin of disassortativity, as mentioned above, many other ques- 
tions are unsolved and are considered to play a potentially important role 
in the field of network research 3J]. Many networks in nature are found to 
be modular as well as hierarchical. The emergence of modularity and how 
it could be reconciled with other properties of networks are basic questions 
in network research. Networks are characterised by topology as well as the 
dynamics that is taking place over it. Are there universal features to the 
network dynamics similar to the topology? Compared to the technological 
networks, the evolution of biological networks is much more complex. What 
could be the evolutionary mechanisms that shape the topology of the biolog- 
ical networks? These and many other questions remain at the forefront of 
the network research. 



Biological Networks 

Among the various networks studied, biological networks are of special in- 
terest as they are the product of long evolutionary history. The mode of cre- 
ation, evolution and functionality of these networks are distinct from those 
of technological networks. 

• Biological networks are the products of natural selection as opposed to 
the rest of the networks. 
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The time-scale at which these have evolved is orders of magnitudes 
larger than that of non-biological networks. 

Since each of these evolutionary machines are the outcome of 'survival 



of the stable' 



35 



page 12] rule, these systems are the most stable and 
robust (against all natural detrimental sources) systems known to us. 



These are of academic interest for their complex, versatile, dynamic, and 
evolvable nature. On the practical side, understanding the nature of bio- 
logical systems have direct or indirect implications to drug design, disease 
diagnosis and cure, epidemic control, and biotechnological applications. 

Biological networks could be characterised by the length-scale as follows. 
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39|,|40|,|4l|; (in metres) 



Ecological (Food Webs) Networks 
Inter-cellular Networks (between micrometres to millimetres) 
Protein-Protein Interaction Networks 



42 



43 



Networks 
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4J] , Metabohc Pathways 



45l . l46l . |47] , Gene Regulation Networks 
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49] ( in 



micrometres) 



Macromolecular Networks-Complex Chemicals, Polymers SOj (in nanome- 
tres) 



Our focus is on 'protein structures', an interesting class of macromolecular 
networks. Proteins are unique among all other networks. They are con- 
stituted from a linear polymer chain of amino acids as opposed to sparsely 
distributed unconnected nodes as in most other networks. They evolve by 
changing their conformation and not by addition or removal of nodes. Their 
polypeptide backbone attains a stable shape through well-defined secondary 
structures and tertiary folds. 

It is important to understand how proteins consistently fold into their native- 
state structures and the relevance of structure to their biological function. 
Network analysis of protein structures is an attempt to study the networks as 
complex dynamical systems composed of a web of interacting elements, and, 
thereof, to understand possible relevance of various network parameters. 

^We classify 'social networks' as non-biological keeping in view our criterion, that a 
system that is shaped by natural selection, for biological networks. 
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1.4 Protein Structures as Complex Networks 



Many efforts have 
terns viewpoint 



51 



) een done to model biological systems from complex sys- 
. Specifically they have been increas- 
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53 



54 



55 



ingly studied as complex networks 57|, |58|, |59|, |60|] . Amongst all the biological 
systems, a system of special interest is that of a 'protein' for its structure, 
function, kinetics and stability. With its omnipresence in the cell and diverse 
functionality, it is a biomolecular system with immense implications to the 
cellular dynamics. It's function is specified by the structure. The structure 
is also associated with it's kinetic properties and stability. All these make a 
protein a very interesting system to study as a 'complex dynamical system'. 
Here, we are specifically interested in studying protein structures as networks 
of noncovalent contacts, and the covalent backbone contacts. 



1.4.1 Fine-grained vs. Coarse-grained Models 

Various approaches 0, 61 1 have been used for studying protein structures 
as well as the protein folding dynamics. Apart from other differences, these 
methods vary in the extent of detail with which the structure is modelled. 
Some consider atomic-level details 



62 



631] , whereas some reduce the structure 



to a chain of beads spatially constrained to a rectangular lattice with a limited 
number of attainable conformations 6J]. The relevance and applicability of 
each of these models, of course, rests on the kind of questions that are asked. 
While fine-grained models are heavy on resources (computer memory, time 
needed, complexity of coding etc.), they are better suited for questions that 
involve aspects of protein that, from experimental studies, are known to be 
dependent on fine structural details. On the other hand, coarse-grained 
models are of special value as they make it feasible to work with a large and 
complex system and offer a systems- level insight. 

By virtue of a large number of constituent atoms and complexity of chem- 
ical interactions amongst them, a protein structure is a system with large 
degrees of freedom, rendering it immensely difficult for detailed modelling 
and analyses. Given the diversity in functional roles of protein and fairly 
large number of structural units (Number of Unique Folds, as defined by 



SCOP 



65|, 



1000 as of Oct. 2006 



661]) that proteins are composed of, it 



makes sense to consider coarse-grained models as a viable option for mod- 
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elling protein structures. Today, thanks to the untiring efforts of crystallogra- 
phers and structural biologists across the world, and because of the advances 
in techniques and instruments, one has open access to a huge repository of 
interesting protein structures such as PDB, MSD-EBI, PDBJ, BMRB. As 
on 15*^ August 2006, there were a total of 38198 structures deposited in the 

ri r7 

Protein Data Bank 1661. Also, from earlier studies 16 'i 



68,^ 



70 



71i, it is 



known that protein structures are amenable to coarse-graining while being 
of practical use. These facts strengthen the case of 'coarse-grained models' 
vis-a-vis detailed, fine-grained ones. 



1.4.2 Range of Interactions in Proteins 

Proteins are characterised by interactions happening at various ranges. Here, 
in the context of linear chain nature of proteins, range is defined as the dis- 
tance between two interacting residues along the polypeptide. Interactions, 
then, can be divided into long- and short-range interactions. Apart from in- 
teractions that take place in the process of folding, many NMR experiments 
have shown that even after reaching the native state, proteins undergo confor- 
mational fluctuations with time scales from several nanoseconds to millisec- 
onds. It has been suggested that such functionally important fluctuations 



are triggered by long-range interactions among a network of residues 72|. 
Communication happening via such long-range interactions is central to pro- 
tein function and proteins have evolved specific mechanisms to address this 

! — \ 

constraint. It has been shown that [73'] information about these mechanisms 
are embedded in the evolutionary record of a protein family. In our work, 
we delineate a range of interactions to study their individual importance and 
contribution. 



1.4.3 Earlier studies on Protein Contact Networks 

So far many studies have been undertaken to investigate protein structures 
as complex networks of interacting residues. 

In an early study jl^, Crippen analysed protein structure in which effort 
was done to offer an objective definition of the domain of a protein. The 
author studied the structural organisation through a binary tree clustering 
algorithm for the residues of a single polypeptide chain. It was found that 
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the protein structure is constituted of a hierarchy of segments that group 
together, then these clusters merge together eventually to form the complete 
chain. 



In another similar study, Rose 75|] developed an automated procedure for the 
identification of domains in globular proteins. Through a slightly different ap- 
proach Rose reached the conclusion that hierarchic organisation of structural 
domains is an evidence in favour of an underlying protein folding process that 
proceeds by hierarchic condensation. 

Aszodi and Taylor [4] modelled linear polypeptides as well as 3-D proteins 
as non-directed graphs. They defined two topological indices, one {connect- 
edness number) for residue-distance measure and another {effective chain 
length) as a foldedness measure, to compare folding topologies. They could 
reveal the hierarchical structure in the non-backbone connections of proteins. 

Kannan and Vishveshwara have used the graph spectral method to detect 
side-chain clusters in three-dimensional structures of proteins. The approach 
they described is used to detect a variety of side-chain clusters and to iden- 
tify the residue which makes the largest number of interactions among the 
residues forming the cluster. Vishveshwara and others 0, S, 77, 78, 79 1 
have consducted many studies with amino-acid networks. 

Vendruscolo et al. [g] showed that protein structures have small-world 31| 
topology. They studied transition state ensemble (TSE) structures to identify 
the key residues that play an important role of "hubs" in the network of in- 
teractions to stabilise the structure of the transition state. They also showed 
that, though homopolymers have high clustering comparable to those of the 
proteins, their betweenness profile is uniform unlike that of the proteins. 

Greene and Higman [3] studied the short-range and long-range interaction 
networks in protein structures and showed that long-range interaction net- 
work is not small world and its degree distribution, while having an under- 
lying scale-free behaviour, is dominated by an exponential term indicative of 
a single-scale system. 

Atilgan et al. studied the network properties of the core and surface of 
globular protein structures, and established that, regardless of size, the cores 
have the same local packing arrangements. They showed that connectivity 
distribution of residues is independent of their spatial location. They also 
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explained, with an example of binding of two proteins, how the small-world 
topology could be useful in efficient and effective dissipation of energy, gen- 
erated upon binding. 

Aftabuddin and Kundu 



80 



81| have studied protein structures as made of 



three classes of amino acids: hydrophobic, hydrophilic and charged. They 
found that average degree of the hydrophobic networks has significantly larger 
value than that of hydrophilic and charged networks. They also found that 
all amino acids' networks and hydrophobic networks bear the signature of 
hierarchy; whereas the hydrophilic and charged networks do not have any 
hierarchical signature. 

Shakhnovich and others have studied protein conformation network to 
study features that make a protein conformation on the folding pathway to 
become committed to rapidly deseeding to the native state. They used a 
macroscopic measure of the protein contact network topology, the average 
graph connectivity, by constructing graphs that are based on the geometry 
of protein conformations. They found that average connectivity is higher 
for conformations with a high folding probability than for those with a high 
probability to unfold. 



Jung et. al. [83| studied the protein structures in search of identification 
of topological determinants of protein unfolding. They find that a newly 
introduced quantity, the impact edgre removal per residue, has a good overall 
correlation with protein unfolding rates. 

Amitai et. al. llj found that active site, ligand-binding and evolutionarily 



conserved residues, typically have high closeness, a network property, value. 
What separates this method from others is that this method solely depends 
on single protein structure's information while making such a conclusion 
and does not rely on sequence conservation, comparison to other similar 
structures, or any prior knowledge. 

In a recent paper Sol et. al. [8J] study proteins as systems that have a per- 
manent flow of information between amino acids. By doing removal experi- 
ments in seven protein families they find that many of the centrally conserved 
residues are also important for allosteric communication. They put these re- 
sults in perspective in view of network dynamics, topology, constraints on 
the evolution of protein structure and function. 
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1.5 In this thesis 

The aim of this work has been to describe and study the three-dimensional 
native-state structures of proteins of different structural and functional classes 
as complex networks, enumerate the general network parameters, and study 
the relation of these parameters to their structural, functional, and kinetic 
properties at different length scales. 

In our studies, we model protein structure as networks of interacting residues. 
We start from detailed fine-grained protein structure with atomic level de- 
tails and obtain Protein Contact Network (PCN) by coarse-graining. In this 
process we keep positional information of atoms which are representatives 
of the amino acids and disregard all the other information. The 'contacts' 
between any two residues represents a possible noncovalent interaction hap- 
pening between them. The cut-off threshold (Re) for deciding a contact is 
chosen accordingly at 8A. We describe the construction of PCN model in 
Chapter O We consider interactions happening at various length-scales as 
described in Subsection ll.4.2[ Long-range Interaction Network (LIN) is a 
subset of PCN and comprises only of the backbone and the long-range in- 
teractions. In this chapter, we also describe the construction of LINs and 
different control networks. Further, we explain various visualisation schemes 
that we have used in our studies. Then we define and illustrate various net- 
work parameters and properties. Finally, we present the data of the proteins 
that will be used in our studies. 

In Chapter [3] we describe our results related to small-world nature of the 
PCNs. We find that protein structures of diverse structural and functional 
classification display small-world nature. We observe that all the 80 proteins 
of different classes have very high clustering coefficient. Despite being struc- 
turally different from globular proteins, even fibrous proteins are found to be 
having small-world signature. We find that PCNs have a clear signature of 
hierarchical nature on the clustering versus size profile. 

PCNs are an unique class of macromolecular complex networks characterised 
by biological origin and evolutionary pressure. Hence one expects PCNs to 
show their unique nature through network properties. In Chapter H] we find 
that PCNs are 'assortative', i.e. rich nodes tend to connect to rich nodes 
and poor nodes tend to make contact with each other. This is an exceptional 
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observation as it is known that (except for social networks) all other complex 
networks are 'disassortative'. We find that LINs, despite their very different 
degree distribution, are also assortative. This is an interesting observation 
as it indicates that the short-range interactions possibly don't contribute 
towards the observed assort at ivity. In our study to investigate the role of 
various network features in bringing in assortativity, we show that degree dis- 
tribution has a major contribution towards conferring assortativity in PCNs 
as well as in LINs. 

In Chapter [5] we investigate biophysical correlates of the topological param- 
eters of PCNs and LINs. For this study we use 30 single-domain two-state 
folding proteins whose rate of folding is known. We find that the exceptional 
topological property, assortativity, has a positive correlation with the rate of 
folding {ln{kp)) for both PCN and LIN of the proteins. Also, we find that 
clustering coefficients of LINs has a very good negative correlation with the 
Inikp). 

Thus with the help of our coarse-grained, complex network models we analyse 
protein structures and study questions relating to structure, function and 
kinetics. 



Chapter 2 



Materials and Methods 



In our aim of analysing the protein structures, we developed various network 
models of protein structures, as well as their controls, and defined various net- 
work properties of these models. In this chapter, in Section 12.11 we describe 
the models that we have used and the procedure of constructing the models. 
Wherever required we also mention the algorithms that were used for this 
purpose. In Section \2.2\ we describe and illustrate a few ways of visuahsing 
the network. Throughout our study we use various network parameters and 
properties to characterise the network system under study. In Section 12.31 
we define and describe these parameters. In Section [2^ we present the data, 
along with other relevant information, of the proteins that we have used in 
our studies. We mention the details of programming languages and the soft- 
ware used in Section 12.51 The pseudocodes of all the programmes (written 
in FORTRAN90 & MATLAB) are given in the Appendix \E 

2.1 Construction of Protein Contact Networks 
and Controls 

In our studies, we have used graph theory to model protein structures. 
Graphs, in general, could be used to model various kinds of systems in which 
nodes (vertices) represent discrete network elements and links (edges) rep- 
resent a well-defined relationship between any two nodes. Below we explain 
two coarse-grained models of protein structure controls that were used in our 
studies. 
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(a) (b) 

Figure 2.1: Two representations of Acyltransferase (2PDD) (a) Ball and stick 
representation. The colours of the atoms are attributed as specified by Ras- 
Mol's 'CPK colour scheme': hydrogen (white), carbon (gray), oxygen (red), 
nitrogen (light blue), and sulphur (yellow), (b) the backbone. 

2.1.1 Protein Contact Network (PCN) 

We modelled the native-state protein structure as a network made of its 
constituent amino-acids and their noncovalent interactions. Protein Contact 
Network (PCN) is a graph-theoretical representation of the protein structure, 
where each amino acid is a 'node' and spatial proximity of any two amino 
acids is a 'link' between them. Any two amino acids were considered to be in 
'spatial contact' if the distance (Re) between their Ca atoms was less than or 
equal to SA. The choice of Rc was based on the range at which non-covalent 
interactions, which are responsible for the polypeptide chain to fold into its 
native-state, are effective. 

A point to note is that, apart from the noncovalent interactions, we consid- 
ered the covalent peptide bonds between consecutive amino acids as links, 
thus representing the backbone of the protein. This chain of backbone-links 
was left unaltered while creating the controls, thus reflecting an important 
aspect of protein folding dynamics: throughout the folding process, the pep- 
tide backbone is unbroken and the protein goes through structural changes 
by making and breaking the noncovalent contacts. 
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Contact Map 



Contact Map (CM) [85|, ISg, l87|, l88|, l89|, l90| is a 2-D, binary, symmetric repre- 
sentation of the protein structure in terms of pair-wise, inter-residue contacts. 
Any two residues are defined to be in 'contact' with each other if the Ca atoms 
of these two residues are within a cut-off distance (-Rc)- Thus the contact 
map is a coarse-grained representation of the 3-D structure of a protein. A 
contact map (M) for a protein with residues is a matrix M of the order 
rir X 71^1 whose elements are defined as. 



M,; 



1 if residues i and j are in contact 
otherwise 



(2.1) 



The Choice of Cut-off Distance (Re) 

The choice of cut-off distance was done based on the chemical interactions 
that are responsible for folding, unfolding, stability, function, etc. The chem- 
istry of these processes is primarily dictated by chemistry of noncovalent in- 
teractions, viz.. Van der Waals interactions, hydrogen bonds, ionic bonds, 
hydrophobic interactions. The cut-off threshold could be varied from a very 
high, fine-grained resolution (say, Rc ~ 4) to a very low, coarse-grained res- 
olution. There is lower as well as upper limit to the cut-off. A value of Rc 
that is less than the resolution of the protein model doesn't make sense. And 
a threshold larger than the size of the protein, again, is meaningless. For our 
purpose, to retain the meaningful information specified by the noncovalent 
interactions, while at the same time not be bogged down by the atomic level 
details, a threshold of Rc = SA is an ideal choice. 

In our studies we have used 7A or sA as a cut-off threshold depending on 
the data-set, though the results are valid for a range of thresholds between 
(at least) 7-9A. For practical purposes the threshold should be considered 
Rc = 8A throughout our studies. Various cut-offs ranging from 5A to 
7A 0, to 8.5A I6i] have been used in earlier studies. 
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Figure 2.2: The Protein Data Bank (PDB) file containing the atomic coor- 
dinates of 2PDD (Acyltransferase). 

Computational Procedure 

The information required for building models of protein structures was ex- 
tracted from its PDB (Protein Data Bank; http://www.rcsb.org/pdb/) file. 
The PDB file contains a large amount of structural details obtained from 
X-ray diffraction or NMR method. We explain the methodology with the 
example of the protein Acyltransferase (2PDD) as shown in Fig. 12.11 Fig- 
ure [22] shows the 'Model' section of the PDB file in which, apart from other 
details, atom number, atom label, amino acid type, amino acid number, and 
coordinates are shown. The amino acids are labelled in increasing order 
from N-terminal to C-terminal residue, starting from 1 upto (1 to 43 for 
2PDD), the total number of residues. First, we extracted three-dimensional 
coordinates of the Ca atoms (CA in Fig. I2.2p . the structural representatives. 
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of amino acids in the network models. Next, we calculated the Cartesian 
distances between all pairs of Ca atoms of the residues. Using a threshold 
i?c, (as described earlier) we computed the 'Contact Map (M)'. Fig. 12.31 (a) 
shows the pairwise distance (in A) matrix for Acyltransferase (2PDD) and 
(b) the corresponding Contact Map with distance threshold Rc = SA. 




Figure 2.3: 'Pair-wise distance matrix' of all Ca atoms from Fig. 12.21 and 
Contact Map after thresholding with a cut-off of 8A. 

The Contact Map then serves as the adjacency matrix for drawing the nodes 
and links of the contact network. Coarse-graining is inherent in the process 
of construction of PCN. Figure [2^ summarises the process of coarse-graining 
involved in the making of PCN. PCN, is created by ignoring a large amount 
of positional information of atoms in the X-Ray data. Starting from atomic- 
level details (Fig. l2.4( a)). we jettison a large amount of structural details to go 
through residue-level details to finally arrive at the two-dimensional Contact 
Map (Fig. 12.4( b)). The protein contact network (PCN) can be reconstructed 
given the coordinates of {Ca atoms of) the residues in the structure to which 
the Contact Map corresponds (Fig. 12.4( c)). 
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Figure 2.4: Coarse-graining of the protein structure data, (a) Ball and stick 
model of 2PDD, (b) its Contact Map, and (c) the PCN with = SA where 
the backbone contact are shown with a blue line and the rest of the non- 
covalent contacts are shown in gray. 

2.1.2 Long-range Interaction Network (LIN) 

The Long-range Interaction Network (LIN) of a PCN was obtained by con- 
sidering, other than the backbone links, only those 'contacts' which occur 
between amino acids that are 'distant' from each other, i.e. residue pairs 
that are, along the backbone, separated by a threshold, termed LRIxhreshoid 
of 12 amino acids [?! or more amino acids. Here, LRIfhreshoid stands for 
the Long-range Interaction threshold, measured in terms of the number of 
residues along the backbone, that is used to decide the range upto which the 
'long-range effects' are taking place. Thus formed, a LIN is a subset of its 
PCN with same number of nodes (n^) but fewer number of links (contacts) 
due to removal of short-range contacts. Fig. 12.51 shows the PCN and its LIN 
of 2PDD. 

This network is of special significance in the context of a linear chain (1- 
D network) model that has additional long-range links happening between 
nodes that are separated along the chain. A protein is one such network 
system in which there is an inherent 1-D structure in terms of the polypeptide 
backbone held together by covalent peptide bonds. The polypeptide chain 
folds onto itself by virtue of the chemical forces acting among the constituent 
residues, thereby creating 'contacts' on various scales as specified by the 
separation distance between the contacting residues. 
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Figure 2.5: (a) PCN and (b) its LIN. 
2.1.3 Random Controls of PCN 

Wc created two types of random networks as controls for the PCNs. The 
polypeptide backbone connectivity was kept intact in both the random con- 
trols, while randomising the noncovalent contacts. For every protein, 100 
instances of each type of random control were generated. An average of all 
the instances were used as a representative of the parameters and properties 
that were compared with PCNs and their LINs. 

Type I Random Control 

The Type I random control has the same number of residues ( well as 

number of contacts (jic) as those of PCN, except that the contacts are created 
randomly by avoiding self-contacts or duplicate contacts. The connectivity 
distribution of the Type 1 random controls, in general, is not the same as 
that of PCNs. The algorithmic steps used for creating the Type I random 
controls were as follows. We started the network with rir number of nodes and 
fir — I covalent contacts representing the backbone. The covalent contacts 
were put in place by sequentially connecting residues from 1 to 2 to 3, and 
so on till Ur- Further, we added all the noncovalent contacts in a random 
manner. First we chose two unique residues using a uniform pseudo-random 
number generator. A noncovalent contact was created between these residues 
provided they were not part of the backbone-forming contacts and if they 
were not already connected. This process was repeated till the total number 
of contacts in the random control is same as those in the PCN. The 'LINs 
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of Type I random controls' were obtained in the same fashion by which the 
LINs were obtained from PCNs. Figure 12.61 shows a typical Type-I random 
control (b) of 2PDD's PCN (a) and its LIN (c). 




Figure 2.6: (a) PCN, (b) Type-I Random Control and (c) its LIN. 
Type II Random Control 




Figure 2.7: (a) PCN, (b) Type-II Random Control and (c) its LIN. 

In Type II random controls, apart from maintaining the number of nodes 
(n^) and contacts {ric), the connectivity distribution as well as individual 
connectivity of PCNs was also conserved. We started with the original PCN 
and then the non-covalent contacts were randomised while maintaining the 
degree of individual nodes. To ensure adequate randomisation of the connec- 
tivity, the pattern of pair-connectivity was randomised 2000 times. In Type 
II random controls, degree distributions of only the PCNs were conserved. 
For the LINs obtained from these controls of PCNs, the degree distributions 
were not explicitly conserved in the randomisation procedure. Figure 12.71 
shows PCN (a), it's typical Type-II random control (b), and its LIN (c). 
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2.2 Network Visualisation 

There are various ways a graph (network) can be visuahsed. Depending 
on the purpose of the visuahsation, one may want to choose appropriate 
visuahsation method. Network systems could be classified based on the way 
the nodes are, if at all, positionally related to each other. A network with (a) 
no positional relationship among its nodes, (b) a linear relationship-a 1-D 
chain, (c) a 2-D order-and, finally (d) each node characterised by positional 
coordinates in a 3-D space. 

When studying a system in which there is no structural order of the nodes, a 
2-D or 3-D visualisation with arbitrary node positions optimised for minimal 
crossings of the links is one of the suitable choices. For a system in which a 
linear order is specified, a chain-like or ring-like representation would capture 
the necessary details. A system with 2-D (3-D) order could be represented 
in the 2-D (3-D) space with appropriate positions of nodes and, if necessary, 
with optimised edge-crossings. 

Contact Map Visualisation 

Contact Map has been defined and used earlier. Here we mention the visu- 
alisation aspects and its relationship to the proteins that they model. The 
principle diagonal of the contact map corresponds to the self-contacts which 
by definition are zero: ISAa = 0. The positions parallel and next to the diag- 
onal correspond to contact separation of (|j — ^| =) 1, which is equivalent to 
the polypeptide chain that is held together with the covalent peptide bonds. 
The elements diagonally parallel and next to backbone represent contacts 
happening between residues which are one residue apart {\j — i\ =2) along 
the backbone. The procedure continues so on and so forth till one reaches 
the single contact possible with |j — «| = — 1 which, when existent in a 
protein, indicates a contact between the N-terminal and C-terminal residue 
of the protein. This understanding could be used for creating appropriate 
models with desired types of chemical connectivities. 
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Chain Representation 

The information content of the protein's contact map can be transformed into 
a representation that offers similar insight about the range at which contacts 
are taking place in the protein backbone. It also gives a hint about the 
locations where the secondary structures are taking place. This 'Chain Rep- 
resentation' of the protein structure, though similar to that of the Contact 
Map, is sometimes more useful as it represents the protein structure in a less 
abstract and easily accessible fashion. Fig. 12.81 depicts, for Acyltransferase 
(2PDD), the parts that Chain Representation is composed of. 

Figure 12.81 shows contact map of the network with (a) short-range contacts 
(|j — i\< 12) in 'blue'. Fig. 12.8( b) shows long-range contacts which (|j — i\> 
12) are shown in 'yellow'. Finally, Fig. 12.8( c) shows the 'Chain Contact 
Map' that is a combination of the above two. The polypeptide backbone 
of Acyltransferase (2PDD) is aligned as a chain of residues along a circular 
curve. The residues are labelled in an increasing order (anti-clockwise) from 
N-terminal to C-terminal. The backbone contacts, which trace the circle, are 
shown in black. The short-range contacts (|j — «| < LRIthreshoid) are shown 
in blue, and those with long-range are shown in red. 

3-D Representation 

As mentioned earlier, the positional information of the residues in the pro- 
tein's 3-D structure is lost in the contact map as well as in chain represen- 
tation. Owing to the relevance of the positional information, PCNs can be 
better visualised in 3-D space. This is achieved by superimposing 'positional 
information' with that in the 'contact map', as shown in Fig. 12.41 (c). 

2.3 Network Parameters and Properties 

Various features of network's topology and dynamics could be measured by 
defining parameters that capture appropriate aspects of it. Below, we de- 
scribe properties that are typically used to characterise a network. Since a 
network could be a directed/undirected and weighted/unweighted, the pa- 
rameters need to be appropriately defined. The following definitions are valid 
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(a) PCN: Short-range Contacts 




10 20 30 40 



Figure 2.8: Drawing 'Chain Contact Map'. Contact Map of PCN with (a) 
short-range contacts, and (b) long-range contacts highhghted. (c) 'Circle' or 
'Chain Contact Map' 
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for any undirected and unweighted network. 

Here, we explain the imphcation of each of these parameters in the context 
of the network model that we build for the protein structures. 

2.3.1 Distance Measures 

Here, Distance is measured in terms of the number of edges that are needed 
to be traversed, to reach to one node from the other node. Many distance 
measures could then be defined which measure different aspects of the protein 
structure. 



Characteristic Path Length 

Shortest path length (Lij), between any two pairs of nodes i & j, is defined 
as the number of links that must be traversed, by the shortest route, from 
one node to another. The average of shortest path lengths, known as 'char- 
acteristic path length' (L), is an indicator of compactness of the network, 
and is defined as 



30| 



nr{nr — 1) ' 

where is the number of residues in the network. 



(2.2) 



This definition is illustrated in Fig. 12.91 (a). The figure shows two of all the 
possible 'paths' between nodes 31 and 20 which are two of the nearest to 
'the shortest path'. Shown with blue-coloured arrows is the path 31 — 32 — >■ 
33 ^ 34 ^ 35 ^ 20, with path-length of 5. Whereas the path with red- 
coloured arrows, 31— >7— s>8^9— s>10^11— 20, has path-length of 6. 
Hence the shortest path length between node 31 and 20 is, -^31^20 = 5. 

Fig. 12.91 (b) shows the shortest paths distribution for the example protein 
network, 2PDD. Analytically, the L is defined for a network with number of 
nodes Ur and average degree {k) as, 

^ _ nr{nr + (k) - 2) 
~ 2{k){nr - 1) 
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Figure 2.9: (a) Illustration of the Characteristic Path Length (Lij), (b) Short- 
est Paths Distribution for 2PDD. 



Diameter 



Another measure for compactness of the network is Diameter (D), which is 
defined as the largest of all the shortest paths in the network. 

D = maxLij,V i-j pairs of shortest paths. (2-3) 
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Figure 2.10: (a) Degree {ki) and (b) Closeness {k). 
2.3.2 Centrality Measures 

Networks representation of a complex system, by definition, embeds the com- 
plex interactions happening among the various elements of the system. De- 
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spite the distributed nature of the elements, some of them could potentially 
hold a 'central' position in the network, thereby being crucial to the topology 
and/or the dynamics. Centrality refers to the structural attribute of nodes 
in the network and not to the attribute of the node itself. 

Degree and Average Degree 

Degree (fcj) of a node i is the total number of neighbours (linked nodes) it 
has. 



Thus defined, degree captures the centrality of the node in terms its connec- 
tivity. The more is the degree, the better connected it is. In PCN, degree (/cj) 
measures the number of other amino acids that amino acid i is spatially prox- 
imal to (with given Rc) in the native state protein structure. Figure l2.1U( a) 
shows the degrees of individual nodes of 2PDD. 

It may be noted that as compared to other biological as well as technologi- 
cal networks, the process of formation and the constraints which shape the 
network structure are very different for the PCNs. Owing to the covalent 
backbone connections and steric and space constraints, the typical degree in 
PCN is much lower than that found in other networks. 

Average degree, {k), of a network with rir nodes is defined as 



Closeness 

Closeness is defined based on the measurements of shortest path length be- 
tween pairs of amino acids. It is a measure that computes the average con- 
nectivity of a residue with the rest of the network. It integrates the effect 
of the entire protein, measured in terms of its shortest distance from every 
other node, on a single residue. It is defined as. 




(2.4) 




(2.5) 
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Figure [2. 10( b) shows the closeness values of individual residues of 2PDD. Any 
property of an amino acid that is dependent on the average connectivity 
of amino acids could potentially be related to closeness. It is known that 
the kinetics and stability of a protein is often dependent on the chemical 
properties of one or a few amino acids. Given this fact the property of 
closeness acquires a special meaning and could possibly be used to explore 
functional relevance of individual amino acids. 



2.3.3 'Pattern of Connectivity' Measures 

The parameters defined so far characterise individual nodes or pairs of nodes, 
measuring their distances or centrality in the network. On a level above this, 
the network is put in place by pattern of connectivities among nodes. This 
pattern could be characterised in following ways. 



Degree Distribution 

Degree symbolises the importance of a node from the perspective of mere 
connectivity — the larger the degree, the more important it is. The distribu- 
tion of degrees in a network is an important feature which characterises the 
topology of the network. It could possibly reflect on the processes by which 
the network has evolved to attain the present topology. The networks in 
which the links between any two nodes are assigned randomly have a Pois- 



son degree distribution [9l(] with most of the nodes having similar degree. 
Fig. 12.11( a) shows the degree distribution pattern of 2PDD. 

Normalised Degree Distribution is the degree distribution normalised with 
the Freq{max), the maximum frequency of the distribution. Henceforth 
P{k) would denote the normalised degree distribution. P{k) allows one to 
compare networks with disparate degree distribution profiles. 

Remaining degree is simply one less than the total degree of a node Q]. If 
Pk is the distribution of the degrees, then the normalised distribution, g^, of 
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the remaining degree is 



{k + l)pfc+i 



CoefRcient of Assortativity 

A network is said to show assortative mixing, or simply 'assortative', if the 
high-degree nodes in the network tend to be connected with other high-degree 
nodes. On the other hand, the network is said to be 'disassortative' if the 
high-degree nodes tend to be connected with other low-degree nodes. The 
Coefficient of Assortativity (r) measures the tendency of degree correlation. 
It is the Pearson correlation coefficient of the degrees at either end of a link 
and is defined 



28j as. 



(2.7) 



9 jk 



where r is the coefficient of assortativity, j and k are the degrees of nodes, 
Qj and Qk are the remaining degree distributions, Cjk is the joint probability 
distribution of the remaining degrees of the two nodes at either end of a 
randomly chosen link, and is the variance of the distribution g^. r is 
a normalised degree correlation function, a global quantitative measure of 
degree correlations in a network, and takes values as — 1 < r < 1. The value 
of r is zero for no specific trend in degree correlations, positive or negative 
for assortative or disassortative mixing, respectively. 
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Figure 2.11: Topological Properties: (a) Degree Distribution {P{k)) and (b) 
Degree Correlations {{knn{k))) 
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Degree Correlations 

Another way to assess the degree correlation pattern in a network is to vi- 
suahse it by measuring the average degree of nearest neighbours, fc„„(A;), for 
nodes of degree k. In presence of correlation, /c„„(A;) increases with increasing 
k for an 'assortative network' whereas it decreases with k for a 'disassortative 
network'. Fig. l2.1lT b) shows the degree correlations pattern of 2PDD. 



2.3.4 Compactness Measures 
Clustering Coefficient 



Clustering coefficient of a node i, Ci, is defined [30| as Cj = 2 * n/ki{ki — 
1), where n denotes the number of contacts amongst the ki neighbours of 
node i. Cj of a node is equal to 1 for a node whose neighbours are fully 
interlinked, and zero if none of the neighbouring nodes do not share any 
contacts. Average clustering coefficient of the network (C) is defined as the 
average of CjS of all the nodes in the network and will be referred to as 
'clustering coefficient' unless specified otherwise. Clustering coefficient is the 
measure of diquishness of the network. 

Numerically the clustering coefficient is computed as follows using the contact 
map. 

' ~ ^2 ' ^ ^ ^ 

where, M is the symmetric, binary, adjacency matrix representation of the 
network. 

Analytically, the C for a network of average degree (fc) is given by, 

^_3 ((fc)-2) 
4((A;)-1) 

Fig. 12.121 illustrates the definition of C . The figure shows a network with 
43 nodes of which node number 29 and 11 are highlighted. With the given 
definition of Cj, we find that C29 = 2/C| = 0.66, and that for node 11 is 
Cii = 0/C| = 0. Obviously, C is 'not defined' for isolated nodes {k = 0), 
and is for nodes with degree 1. 
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Figure 12.131 shows Cj's of individual residues of 2PDD and the their his- 
togram. 




Figure 2.12: Illustration of the Clustering Coefficient (C,). 
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Figure 2.13: Clustering Coefficient (C^) and its distribution (P(C)) 
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2.4 Data 

For most part of the studies in (Chapter O and H]) we analysed a total of 80 
proteins belonging to different functional categories. Of these 80 proteins, we 
had 20 each from a (Table No. El]), /5 (Table No. a/ (3 (Table No. Q, 
and a + (3 (Table No. 12.31) structural class. Here, we followed SCOP 65| 
classification of proteins, a and (3 class of proteins consist of proteins that 
are made of a helices and (3 sheets respectively, a/ (3 class consists of mainly 
parallel beta sheets {(3-a-j3 units). a + (3 class consists of mainly anti-parallel 
beta sheets (segregated a and [3 regions). 

For Chapter [5] we considered only small globular proteins. These were 30 
single-domain two-state folding proteins (Table No. 12.51 and 12.61) . Follow- 
ing tables categorise each protein in terms of name and other classification 
details. 
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Table 2.1: Data table for 20 proteins of a structural class. 
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Table 2.2: Data table for 20 proteins of P structural class. 
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Hydroxymcthyltransferase 


Tr ansf er ase 


1240 


5 


2.80A 


1Q02 


Ril)()nucleoti(l Isonierase 


Isonierase 


182 


2 


1.85A 


IQTW 


DNA Repair Enzyme 
Endonuclcase IV 


Hydrolase 


285 


1 


1.02A 


lYLV 


COMPLEX 


Lyase 


341 


1 


2.15A 


2TPS 


Tiamin Phosphate Synthase 


Thiamin Biosynthesis 


452 


2 


1.25A 


8RUC 


Spinach Rubisco Complex 


Lyase (Carbon-Carbon) 


2359 


8 


1.50A 



Table 2.3: Data table for 20 proteins of a/P structural class. 



PDB ID 




Eimrtional Class 


Size f 71 1 


No. of 
Chains 


Rpsoln 
fif X-rav) 


lALC 


Baboon Alpha-Lactalbumin 


Calcium Binding Protein 


122 


1 


1.70 A 


lAVP 


T~rnrn^^Ti AHpnnvinis ^. 

Proteinase 


RvH rolasp 


215 


2 


2. 60 A 


IBRN 


Barnase 


Endonuclease 


216 


2 


1.76 A 


ICNS 


Chitinase 


Anti-Pungal Protein 


486 


2 


1.91A 


ICQD 


dvstpinp Protpa,sp 


Hvdrolasp 


864 


4 


2.10A 


lEUV 


ULPl Protease Domain 


Hydrolase 


300 


2 


1.60A 


J. J. J. (J 


rTnmfi'n ( ,pllnlfi'r 

Clo^^ p^nl^itinn T^^irtor X^TTT 


V^WCtt, LllCHjlUll J- d'^UUl 


J_ tit: J. 


2 


2 loA 


IGCB 


T~)T\[ A-RinHin Protp^mp 

J y J. 1 ^ i. 1 J 1 1 1 1 ) 1 1 1 J. i V_7 u VjCAjOVj 


T~)1\F A-RinHiriP' Rrotpin 


452 


1 


2.20A 


IGOU 


RiTinmiHpj^fip Ritij^sp 


Hvd ml a sp 


218 


2 


1.65 A 


IIWD 


A Plant dvstpinp 

Protease Ervatamin 


Hvdrola,sp 


215 


1 


1.63 A 


1K3B 


Human Dipeptidyl 
Peptidase I 


Hydrolase 


352 


3 


2.15A 


ILNI 


A Ribonuclease 


Hydrolase 


219 


2 


l.OOA 


ILSD 


Lysozyme 


Hydrolase 


129 


1 


1.70 A 


1ME4 


COMPLEX 


Hydrolase 


204 


1 


l.lOA 


1MZ8 


COMPLEX 


Toxin Hydrolase 


435 


4 


2.00A 


IPPN 


Papain Cys-25 


Hydrolase 


212 


1 


1.60A 


IQMY 


FMDV Leader Protease 


Hydrolase 


468 


3 


1.90A 


IQSA 


Lytic Transglycosylase 


Transferase 


618 


1 


1.65A 


lUCH 


Deubiquitinating 
Enzyme UCH-L3 


Cysteine Protease 


206 


1 


1.80A 


2ACT 


Actinidin 


Hydrolase (Protease) 


218 


1 


1.70A 



Table 2.4: Data table for 20 proteins of a + /3 structural class. 



PDB ID 




Name 


IHRC 


104 


Horse heart cvtochrome C 


IIMQ 


86 


Colicin e9 immunitv protein IM9 


lYCC 


108 


Yeast ISO-l-cytochrome C 


2ABD 


86 


Acyl-coenzyme a binding protein from bovine liver 


2PDD 


43 


Acetyltransferase 


lAPS 


98 


Acylphosphatase 


ICIS 


66 


Chymotrypsin inhibitor 2 and Helix E 


ICOA 


64 


The hydrophobic core of chymotrypsin inhibitor 2 


IFKB 


107 


Rapamycin human immunophilin FKBP-12 complex 


IHDN 


85 


Phosphocarrier protein HPR from e. coli 


IPBA 


81 


Activation domain of porcine procarboxypeptidase B 


lUBQ 


76 


Ubiquitin 


lURN 


96 


UlA mutant/RNA complex + glycerol 


IVIK 


99 


HIV-1 protease 


2HQI 


72 


Oxidized form of MERP 


2PTL 


78 


Immunoglobulin light chain-binding domain of protein L 


2VIK 


126 


Actin-severing domain villin 14T 



Table 2.5: Data table for single-domain, two-state folding proteins, belonging to a and /3 class. 



CO 
00 



PDR TD 




]\r?i TTip 


1 AF,Y 




A l"nlii^-<5'nppf "rin VinmnlncrA/ ^ Hrnni^in 

Xi-l L/11(X O L/Cv^ui 111 kJl LV^' llUlllUlUc, y KJ \J.U111CU11 


1CSP 


fi7 


Ri^pilniQ Giini'ilic; TyiPioT pnln ^nnr'K FiTn'I'PiTi 

J-JdL/lll no iSLlL/ulllO lllOilWl L.Ull_i ol±UL.i\. UlUljC'lll 


IMTf! 

±lVJ.tJ v_ 




1. lie. lild 1 Vjl L/vJlLl olivJL.i\. Ul \J ULvlli KJl L.. UL/t t 


1NYF 


58 


kJllO LiOllldlli lltjlll J-V-l-J- JJi *J ttJ tJllL^tJ^dit; tVlvJolliC JVllictoC 


1PKS 


7fi 


Thp PT3K SH3 Hnmain 

J- lie 1 ±(jxv kjii(j viuiiicmi 


ISHF 


59 


The SH3 domain in Human FYN 


ISHG 


57 


SRC-homology 3 (SH3) domain 


ISRL 


56 


The SRC SH3 domain 


ITEN 


89 


Fibronectin Type III domain from tenascin 


ITIT 


89 


Titin, IG repeat 27 


IWIT 


93 


Twitchin immunoglobuhn superfamily domain 


2ArT 


71 


Alpha- aiii}'lase iiiliil)itor teiKlaiiiistat 


3MEF 


69 


Major cold-shock protein from escherichia coli 



Table 2.6: Data table for single-domain, two-state folding proteins, belonging to a(5 class. 
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CO 
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2.5 Software used 

Following is the list of the software (programming languages and software 
utilities) used in various parts of the study. 

• FORTRAN90 

FortranQO was used to program most of the algorithms needed for the 
network analyses. 

• MATLAB 

MatLab was primarily used for the visualisation purpose. Though ex- 
tensive programming had to be done for creating intricate and detailed 
graphics to complement the analyses. URL: www.mathworks.com 

• Gnuplot 

Since most of the work was done on Linux platform, mainly Gnuplot 
was used for plotting purpose. Extensive coding was done to automate 
the graph generation process on mass scale. URL: www.gnuplot.info 

• Graphviz 

The Fortran90 code was programmed to generate standard Graphviz 
input file. Files so generated were fine-tuned while laying out the graphs 
in Graphviz. URL: www.graphviz.org 

• PERL 

It was primarily used for extraction of the required data from the PDB 
files. URL: www.perl.com 

• Octave 

Octave was used as a replacement for MatLab whenever required on 
the Linux platform. URL: www.octave.org 

• Pajek 

This useful graph layout package was used many times, though Graphviz 
was preferred over Pajek. URL: vlado.fmf.uni-lj.si/pub/networks/pajek 



Chapter 3 

Small- World Nature of Protein 
Contact Networks 



3.1 Introduction 



There have been several efforts to study protein structures as (graphs) net- 
works. In these studies the effort has been to analyse globular proteins as sys- 
tems composed of interacting parts. In recent years, with the elaboration of 
network properties in a variety of real networks, Vendruscolo et al. \^ showed 
that protein structures have small-world topology. Greene and Higman 0] 
studied the short-range and long-range interaction networks in protein struc- 
tures of 65 proteins and showed that long-range interaction network is not 
small world and its degree distribution, while having an underlying scale-free 
behaviour, is dominated by an exponential term indicative of a single-scale 
system. Atilgan et al. studied globular protein structures and analysed the 
network properties of the core and surface of the proteins. They established 
that, regardless of size, the cores have the same local packing arrangements. 
They also explained, with an example of binding of two proteins, how the 
small-world topology could be useful in efficient and effective dissipation of 
energy, generated upon binding. 
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3.2 Small- World topology of PCNs 

The small-world nature of protein networks is a basic finding. The small- 
world nature of a network is reflected in two properties: high clustering 
compared to their random controls, and a logarithmic increase in the char- 
acteristic path length with increase in the size of the network. 
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Figure 3.1: Distribution of sizes of proteins analysed. 

The function that a protein serves in the cell is decided by the structure of the 
protein. Proteins, owing to oft-repeated structural constructs, could be clas- 
sified 65| (Structural Classification of Proteins, http:/ /scop. mrc-lmb.cam.ac.uk/scop/) 



based on their structural composition. We analysed 80 proteins (listed in Ta- 

12.41) . 20 each from four major categories (a. /?, a/P, 



ble Nos. EH 

a + (3) of the SCOP structural classification. These are from diverse func- 
tional groups: hydrolase, transferase, protease, calcium binding, oxydoreduc- 
tase, antifungal, signalling, transport, toxin, coagulation factor etc. to name 
a few. The size of these proteins varied from 73 to 2359 amino acids. Fig. 13.11 
shows the histogram of size of these proteins and their break-up across the 
structural classes. 

We calculated the average clustering coefficient (C) and the characteristic 
path length (L) of the proteins. Fig. 13.21 (a) shows the L versus C plot. 
As seen in the figure, on the scale of to 1, the proteins have a very high 
value of clustering coefficients. Apart from very high C, what is interesting 
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Figure 3.2: (a) L-C plot of proteins from four structural classes, (b) Increase 
in the L of proteins with logarithmic increase in size (n^). The dotted line is 
a log-linear fit to the PCN data. 

is that these 80 proteins are almost indistinguishable with this parameter. 
Thus while presenting a generic property (that of high clustering) , of proteins 
similar to that of a large number of other complex networks, the small-world 
network result provides a grim picture in terms of our ability to correlate 
this specific network (geometric) parameter to the proteins' structure and 
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Figure 3.3: The small world nature of the proteins. Inset: Standard devia- 
tions of the corresponding data. 

function. 

For a network to be classified as a small-world network, apart from high 
clustering, its L should increase only as a logijir). Such a logarithmic scaling 
of L with rtri makes it a small-world network, i.e. any node on the network 
could be reached from any other node in an exceptionally few number of 
steps. Fig. 13.21 (b) shows that the L of these 80 PCNs scale logarithmically 
with the size of the network. These two properties thus ascertain the small- 
world nature of the PCNs across structural classes. Fig. 13.21 (b) also shows 
L of random controls of PCNs (marked with an arrow). 

Fig. 13.31 shows the summary plot of L-C for all 80 proteins, with their Type- 
I random controls in the bottom-left, regular controls in the extreme-right, 
and PCNs in the middle. The inset of the figure shows the means and 
standard deviations of L and C of the corresponding data. As seen in the 
figure the L of PCNs are of the same order of magnitude as those of their 
Type I random controls. PCNs of these proteins have very high clustering 
coefficients compared to their random controls (statistically significant, p < 
0.001; Two-Sample Kolmogorov-Smirnov Test). The L and C computed 
here and in the rest of this chapter, for random and regular controls, were 
computed based on analytical formulae mentioned in Subsection 12.3.11 and 
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Subsection 12.3.41 respectively. 

3.3 Globular and Fibrous Proteins 

Most proteins are "globular" in their three-dimensional structure, into which 
the polypeptide chain folds into a compact shape. In contrast, "fibrous" pro- 
teins have relatively simple, elongated three-dimensional structure suitable 
for their biological function (see Fig. 13.41 (b)). The "small- world" nature of 
globular proteins was argued to be required for enhancing the ease of 
dissipation of disturbances. If that were true, the fibrous proteins should 
depart from the small-world nature. We studied fibrous proteins and com- 
pared their network properties with globular proteins of comparable sizes. 
Table 13.31 shows the details of these proteins. As shown in the L-C plot 
in Fig. 13.4( a). fibrous proteins have larger L, although the C are similar to 
those of globular proteins. Thus, in this respect, the fibrous proteins also 
show "small-world" properties. The average diameter for the fibrous pro- 
teins [D = 15) was found to be larger than that of the globular proteins 
{D = 8.57). This is expected because of the elongated structure of fibrous 
proteins. Despite this major difference in structure, the network properties 
of fibrous proteins and globular proteins are not very different. This in- 
dicates that the "small-world" property of proteins is generic and persists 
irrespective of structural differences. 



Sr.No. 


PDB ID 


fir 


L 


C 


Fl 


ICGD 


90 


5.401 


0.7463 


F2 


ICAG 


88 


5.274 


0.6933 


F3 


1EI8 


172 


5.610 


0.6045 


F4 


IQSU 


89 


5.337 


0.6432 


G5 


IAEA 


87 


3.382 


0.5942 


G6 


1AE2 


86 


4.066 


0.5952 


G6 


lAYI 


86 


3.812 


0.6025 


G7 


1C6R 


88 


3.740 


0.6055 


G8 


ICEI 


85 


3.713 


0.6024 


G9 


ICTJ 


89 


3.763 


0.5968 


GIO 


IDSL 


88 


3.404 


0.5509 



Table 3.1: List of four fibrous(Fl-F4) and seven globular proteins (G5-G 10) 
analysed. 
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^ Fibrous Proteins (4) 
O Giobuiar Proteins (7) 






(a) 
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A Fibrous Protein (1cag) 



A Globular Protein (1a6m) 




0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 

c 



Figure 3.4: (a) L-C plot of Fibrous and Globular proteins of comparable 
sizes, (b) Examples of three-dimensional structures of a fibrous and globular 
protein (not to the scale) with their PDB codes. 

3.4 a and (] Proteins 
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Figure 3.5: L-C plot for a and (3 proteins. Arrows indicate the means of C 
for a and (3 proteins. 

As seen earlier (Figure [3. 2( a)). both a and (3 class of proteins show small- 
world properties. Given that these are two distinct structural units one would 
want to know how that reflects on the global network parameters of a and 
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(3 proteins. On finer analysis of L-C properties, we find that, while they are 
indistinguishable on L-scale, there is a marginal, yet consistent difference in 
the clustering coefficients of a and (3 proteins as shown in Fig. 13.51 The mean 
of C for a and [3 proteins studied are 0.588 and 0.538, respectively. According 
to Kolmogorov-Smirnov test, this difference is statistically significant {p < 
0.001). 



3.5 Degree Distributions of PCNs and LINs 

The distribution of the degrees is an important property which characterises 
the network topology. The degree distribution of a random network is charac- 
terised by a Poisson distribution. The degree distribution of many real- world 
networks has been shown to be that of the scale-free type 92]. Many mod- 
els have been proposed to explain the evolution of network and the degree 
distribution with which they are characterised at present. 

We analysed the degree distribution of the 80 proteins mentioned above. 
Figure 1331 shows the normalised degree distributions of a, P, a + P, and a/ j3 
protein networks. Figure 13.61 shows the scatter plot of normalised degree 
distributions {P{k)) of all 80 proteins of four different classes. Data points 
in each plot indicate P{k) values for all the residues of 20 proteins of the 
respective class. Solid line is a Gaussian fit to the mean of P{k) for each 
value of k. 



As seen, shapes of these distributions are single humped, Gaussian-like 
Importantly, unlike in scale-free degree distributions the number of nodes 
with very high degree falls off rapidly. This is interesting as in scale-free 
networks high-degree nodes (hubs) are known to be the facilitators of com- 
munication across the network by providing shorter routes through them. 
Hence hubs would partially explain small- world nature. But, clearly, the 
distribution of contacts in proteins is dominated by an exponential term. 

Figure 13.71 shows the 1-cr standard deviation of the data of 20 PCNs for the 
normalised degree distribution of respective classes. Solid line is a Gaussian 
fit to {P{k)), the mean of P{k) for each value of k. The Gaussian fit was 

obtained with ^ 

/ \ A 2fx ^c) 

y{x) = exp 



; WVJ^ 
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5 10 15 20 5 10 15 20 




Degree (k) Degree (k) 

Figure 3.6: Scatter plot of degree distributions for (a) a, (b) (3, (c) a + P, and 
(d) a/P proteins, 20 of each class. The solid line is the Gaussian fit through 
the means. 



Table 13.21 gives parameter values of the goodness of fit. Here, i?^ is the 
coefficient of determination. 



Class 




w 


A 




a 


7.922 


3.524 


3.443 


0.9126 


P 


8.175 


5.429 


6.555 


0.9189 


a + p 


7.506 


4.216 


5.535 


0.9732 


a/p 


7.961 


5.192 


6.146 


0.9684 



Table 3.2: Degree Distribution Curve Fitting. Parameters and goodness of 
fit. 

Degree distribution of LINs is shown in Fig. 13. 8[ As seen the P{k) of LINs 
show a single-scale decay with no typical node present in them. 
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Figure 3.7: Degree distributions for (a) a, (b) (3, (c) a + P, and (d) a/j3 
proteins, 20 of each class. The error-bars show l-cr standard deviation around 
the mean value. 

3.6 Diameter of PCNs and LINs 

The concept of diameter, strictly speaking, is applicable only to single- 
component graphs. Owing to the presence of the backbone connectivity, 
PCNs and its other versions are always single- component. Diameter is ex- 
pected to scale with the number of nodes in the same way as the characteristic 
path length (L). Fig. 13.91 shows that D does scale logarithmically with rij.. 
Diameter, since it is maximal of the distances between two nodes, the growth 
of D with fir imposes upper limit on the rate of growth of L with n^. 

3.7 C-Ur Plot 

Clustering coefficient is essentially the probability of formation of triangles 
in the network. In a random network the probability that a given node's two 
first-neighbours themselves are connected is equal to that of any two ran- 
domly selected nodes are connected. Therefore, clustering coefficient (Crand) 
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Figure 3.8: Scatter plot of degree distributions for LINs of (a) a, (b) j3, (c) 
a + (3, and (d) a/ (3 proteins, 20 of each class. Data points in each plot 
indicate P{k) values for all the residues of 20 proteins of the LINs of the 
PCNs of respective class. 



of a random graph is given by 



a 



rand 



P 



(3.1) 



Therefore, according to Eq. 13. II when Crand/ {k) of random networks is plotted 
as a function of rir for varying sizes of the network, the data will show a linear 
nature with slope —1. The random controls of PCN show such behaviour as 
shown in Fig. 13.101 (The data pointed with an arrow). 

Figure [3. 101 also shows the change in C with changing size of PCNs. Here the 
C of PCNs do not change with the size of the network (n^) which indicates 
that the PCNs, far from being random, show an indication of hierarchical 
structure [321 in them. 
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Figure 3.9: Diameter of the PCNs. (a) a, (b) P, (c) a + P, and (d) a/P 
proteins, 20 of each class. 
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Figure 3.10: Change in C of PCNs with increasing n^, indicating the hier- 
archical nature of the protein structures. Random Control data is pointed 
with an arrow. 

3.8 Discussion 

Our results show that protein networks have "small-world" property regard- 
less of their structural classification {a, P, a+P, and a/ P) and tertiary struc- 
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tures (globular and fibrous proteins). Small world nature implies that PCNs 
have high degree of clustering 

aaaa 

(compared to their random coun- 
terparts). Clustering, for protein structures, represents the extent /density 
of packing in the network. Thus higher order compaction, observed in pro- 
teins, is in agreement with what is expected from globular polymer chains in 
contrast to 'randomly folded control polymers'. 

Though small but definite differences exist between a and (3 classes, and 
fibrous and globular proteins. The size independence of the clustering co- 
efficient in proteins indicates a departure from the random nature and an 
inherent modular organisation in the protein networks. 

It is interesting to note that unlike other networks, PCNs while being small- 
world are not characterised by scale-free degree distribution. The absence of 
hubs in PCNs is understandable as there is a physical limit on the number 
of amino acids that can occupy the space within a certain distance around 
another amino acid. Such system-specific restrictions have been identified 
to be responsible for the emergence of different classes of networks with 
characteristic degree-distributions by Amaral et al. 93|]. They observed that 
preferential attachment to vertices in many real scale-free networks [16] can 
be hindered by factors like ageing of the vertices (e.g. actors networks), 
cost of adding links to the vertices, or, the limited capacity of a vertex (e.g. 
airports network). 



Chapter 4 

Assortative Mixing in Protein 
Networks 



4.1 Introduction 



In recent years, there has been considerable interest [16|, |29|] in structure 
and dynamics of networks, with apphcation to systems of diverse origins 
such as society (actors' network, collaboration networks, etc.), technology 
(world-wide web, Internet, transportation infrastructure), biology (metabolic 
networks, gene regulatory networks, protein-protein interaction networks, 
food webs) etc. The aim of these studies has been to identify correlation 
between general network parameters to the structure, function, and evolution 
of the wide variety of systems. 



Assortative Mixing 

While analysing and later modelling the evolution and structure of real- 
world complex networks many features have been taken into account: the 
path length, clustering, degree distribution, and degree correlations. A lot of 
emphasis has been given to degree distribution. The pattern of connectivity 
among the nodes of varying degrees also affects the interaction dynamics of 
the network. Degree correlations is a measure that computes the strength 
and pattern of connectivity. Degree correlations were largely neglected until 
it was emphasised, as shown in Table HTTl that most real- world (except social) 
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networks are disassortative [28]. It is evident that all rea 
diverse origin are characterised by disassortative mixing 



-world networks of 
28|. 



network 


n 


r 


physics coauthorship 


52,909 


0.363 


biology coauthorship 


1,520,251 


0.127 


mathematics coauthorship 


253,339 


0.120 


film actor collaborations 


449,913 


0.208 


company directors 


7,673 


0.276 


Internet 


10,697 


-0.189 


World-Wide Web 


269, 504 


-0.065 


protein interactions 


2,115 


-0.156 


neural network 


307 


-0.163 


food web 


92 


-0.276 



Table 4.1: Size {n) and assortativity coefficient (r) of a number of real- world 
networks. Data adopted from [l^]. Except for social networks, which have 
positive rs, all other networks are disassortative. 



Social networks with their assortative nature, imply that they are fundamen- 
tally different from other networks and the property has been claimed 26 1 
to be originating from their unusually high clustering coefficients and com- 
munity structure. Recently, assortative mixing has been demonstrated in 
brain functional network 



94| . but no biological basis has been assigned to 
the property. The disassortative degree mixing in most complex networks is 
an unsolved riddle, and questions regarding the origin of this property and 
whether this is an universal property of complex networks has been adjudged 
as "one of the ten leading questions for network research" js^. 

Biological networks are of special interest as they are the products of long 
evolutionary history. The protein contact network is exclusive among other 
intra-cellular networks (such as metabolic networks, gene regulatory net- 
works, protein-protein interaction networks) for their unique method of syn- 
thesis as a linear chain of amino acids, and then folding into a stable three- 
dimensional structure through short- and long-range contacts among the 
residues. 

Proteins are characterised by the covalent backbone connectivity. Short- as 
well as long-range contacts are made in the process of folding. It is known 
that short-range contacts are responsible for well-defined secondary struc- 
tures such as a-helix and /3-sheets. The structure into which the protein 



4.1 Introduction 



55 



PDB ID 


Class 


rij. 




k 




^ PCN 


ruN 


IHRC 


a 


104 


488 


17 


9.3846 


0.1821 


0.3531 


IIMQ 


a 


86 


411 


17 


9.5581 


0.2586 


0.3675 


lYCC 


a 


108 


505 


17 


9.3519 


0.2449 


0.3379 


2ABD 


a 


86 


405 


17 


9.4186 


0.2874 


0.2736 


2PDD 


a 


43 


175 


14 


8.1395 


0.1436 


0.2616 


1 AF,Y 

X ill J L 


ft 


58 


971 


1 5 

X O 


9 3448 


1 1 45 


9899 




f] 

H 


67 


308 


1 6 

X U 


9 1 94 


9999 


384 




f] 

H 


6Q 


31 5 


1 6 

X U 


9 1 304 


3097 


4115 

U .t: X X U 


1 NYF 


P 


OO 




X o 




n 1 7^9 




IPKS 


3 


76 


385 


17 


10.1316 


0.1872 


3326 


ISHF 




59 


269 


16 


9.1186 


0.1511 


0.5789 


ISHG 




57 


265 


16 


9.2982 


0.1503 


0.4414 


ISRL 




56 


260 


16 


9.2857 


0.2101 


0.4433 


ITEN 





89 


415 


17 


9.3258 


0.1645 


0.5649 


ITIT 





89 


430 


17 


9.6629 


0.2048 


0.1212 


IWIT 





93 


489 


17 


10.5161 


0.0884 


0.4072 


2AIT 





74 


374 


17 


10.1081 


0.1827 


0.437 


3MEF 





69 


316 


15 


9.1594 


0.3359 


0.3133 


1 APS 




y o 




1 f> 




n 1 Q"^ 


n ZL899 






66 


304 


1 7 

X 1 


9 91 91 


9935 


4893 


1 noA 

X Wii 




64 


974 


1 7 

X 1 


8 5695 


9805 


3439 


IFKB 




107 

X \J t 


539 


1 5 

X 


1 0748 


1704 


4269 


IHDN 


a0 


85 


428 


16 


10.0706 


0.1678 


0.4305 


IPBA 


a0 


81 


345 


14 


8.5185 


0.3228 


0.2856 


lUBQ 


a0 


76 


326 


13 


8.5789 


0.1782 


0.2977 


lURN 


a0 


96 


444 


18 


9.25 


0.3568 


0.1949 


IVIK 


a0 


99 


430 


15 


8.6869 


0.5191 


0.2061 


2HQI 


a0 


72 


407 


18 


11.3056 


0.145 


0.1623 


2PTL 


a0 


78 


334 


14 


8.5641 


0.5179 


0.3125 


2VIK 


a0 


126 


616 


19 


9.7778 


0.4144 


0.464 



Table 4.2: Data table for 30 single- domain two-state folding proteins of a, 
0, and a0 class. 

chain folds and many of its properties hinge upon the long-range contacts 
that are made on various 'scales', as specified by the separation distance 
between the contacting residues. 

Thus, PCNs are a special class of network systems. Small-world property of 
proteins, as studied in Chapter [3l is a reflection on the compact nature of the 
protein molecules. Other than that we investigated various network features 



4.2 Data 



56 




k (Degree) k (Degree) 

Figure 4.1: Normalised degree distributions P{k) of (a) PCNs and (b) LINs. 
Shown in the insets are (a) Type I Random Controls of PCNs and (b) their 
LINs. Thick lines are the best-fit curves for the means of the data. Error- 
bars indicate standard deviation of the data for P{k) of nodes with degree k 
across the 30 proteins analysed. 

of protein contact networks at different length scales (viz. PCN and LIN) 
in an attempt to get a better understanding of its structure, function, and 
stability. In this chapter we analyse assortative mixing of PCNs and LINs. 

4.2 Data 

In the earlier study (Chapter [3]) we had considered a set of 80 proteins, 20 
proteins each from four SCOP structural classifications. Here, we considered 
30 separate proteins to study. These 30 proteins (Table No. 12.51 and 12.61) 
were single-domain, two-state folding proteins. Note: Henceforth, unless 
and otherwise mentioned, we use these 30 proteins for our analyses, while 
supplementing it with results from other data when required. Table W72\ gives 
the following details: no. of nodes, n^, no. of contacts, n^, maximum degree, 
kmax and the average degree (k) of the PCNs, and the network parameters 
studied in this chapter-the coefficient of assortativity of PCNs and their 
LINs, rpcN & ruN- 
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4.3 Degree Distributions 

As done with other proteins, we first studied the normahsed degree distri- 
butions of PCNs and LINs of these 30 proteins. As seen in Fig. 14.1( a). the 
PCNs have Gaussian degree distribution. The parameters and expression 
with which the best fit was obtained are: 

with A = 5.538, w = 6.265, and = 9.373. 

On the other hand. Fig. 14.1( b) shows that the degree distribution of LINs is 
very different than those of PCNs. In LINs, most nodes were populated in 
the low-degree region and very few of them have high degrees. The best-fit 
for the LINs represents a single-scale exponential function [3] , 

P{k) ~ k-^exp{-k/k^), 

with 7 = 0.24 and kc = 4.4. 

The nodes of degree 1 in the degree distributions of LINs are the N- and 
C-terminal amino acids that are at the either end of the protein backbone. 



As expected j91j, the Type I random controls of the PCNs (Fig. 14.1( a). 
inset) have a Poisson degree distribution. LINs of Type I random controls 
(Fig. 14.1( b). inset) too have a Poisson degree distribution. The figure clearly 
shows that these properties are the same for all the proteins 0, S]- 



4.4 Assortative nature of PCNs and LINs 

We studied these 30 single- domain, two-state folding proteins for the ex- 
istence of degree-degree correlations in PCNs and LINs. We first studied 
{knn{k)) versus k profiles of these proteins. As mentioned earlier a trend in 
degree correlation profile is a signature of appropriate degree mixing in the 
network. 

Fig. 14.21 shows {knn{k)) versus k plots for the PCNs (□ in Fig. 14.2( a)) and 
LINs (□ in Fig. 14.2( b)). The nature of these curves shows that both PCNs 
and LINs were characterised with 'assortative mixing', as the average degree 
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5 10 15 20 5 10 15 20 



k (Degree) k (Degree) 



Figure 4.2: (a) Degree correlation pattern comparison between PCN and 
their Type-I Random Controls, (b) Comparison of degree correlation pattern 
of LINs and LINs of Type-I Random Controls. 



(a) PCNs 

1 5 (Type I) Random Controls of PCNs 




(b) 



LINs of (Type I) Random Controls 




r (Coefficient of Assortativity) 



0.2 0.4 
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Figure 4.3: Histograms of 'Coefficient of Assortativity (r)' of (a) and (b) 
PCNs and LINs (■) and their (Type I) Random Controls (□). 

of the neighbouring nodes increased with k. In comparison, the {knn{k)) 
remained almost constant for the Type I random control of PCNs (O in 
Fig. |42ta)) and LINs of PCNs (O in Fig. 1121(b)), indicating lack of correla- 
tions among the nodes' connectivity in these controls. Fig. 14.21 (a) and (b) 
very clearly brings forth the assortative nature of PCNs as well as their LINs. 

The normalised degree correlation function, r, is zero for no correlations 
among nodes' connectivity, and positive or negative for assortative or disas- 
sortative mixing, respectively. We computed rp^jvs and r^j^s of the proteins 
(Table 14.21) . The r for both, PCNs and LINs of the 30 proteins, were found 
to be positive, indicating that the networks are assortative. Fig. 14.31 shows 
the histograms of r of (a) PCNs, (b) LINs, and their Type I random con- 
trols. The r values of both PCNs as well as LINs of all the proteins show 
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significantly high positive values (range: 0.09 < r < 0.52 for PCNs, and 
0.12 < r < 0.58 for LINs). Thus these naturally-occurring, biological net- 
works, are clearly characterised by high degree of assortative mixing. The 
Type I random controls in Fig. 14.31 (a & b), for both PCNs and their LINs, 
are distributed around zero, confirming the observation of lack of degree 
correlations of the controls, made in Fig. 14. 2[ 

These properties of positive r and assortative degree correlations were also 
observed (See Figure 14.41) for a large number of protein structures used in 
studies in earlier chapter belonging to diverse structural categories. 



a p 




k k 



Figure 4.4: Degree correlation pattern of PCNs. Assortative mixing of PCNs. 
The circles (o) represent {knn{k)) for a give value of k across all proteins; 
Filled squares averages of these values showing the trend of degree correla- 
tions. 
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4.5 Degree Distribution partially accounts for 
assortativity 

To investigate the possible role of different network features we built ap- 
propriate controls as discussed in Subsec. I2.1.3I Specifically we investigated 
whether the distribution of degrees has any effect on observed assortativ- 
ity in PCNs and LINs. We studied the 'coefficient of assortativity' of Type 
II random controls of the PCNs, in which, apart from having same num- 
ber of nodes and contacts, we also preserved their degree distribution while 
randomising the pair-connectivities. In Fig. 14.51 and Figs. 14.61 we show that 
the assortativity is partially recovered in the Type II random controls for 
both PCNs and their LINs. Thus degree distribution partially explains the 
observed assortative mixing. This implies that preserving the degree distri- 
bution of PCN, even while randomising the pair-connectivities, is important 
in order to partially restore the assortative mixing in the random controls 
of PCNs as well as their LINs. The recovery of assortative mixing in the 
LINs by Type II random controls of PCNs is even more surprising, as the 
degree distribution of LINs (Fig. 14.1( b)) is very different compared to the 
PCNs ( Fig. 14.1( a)). This is especially significant in the light of the observa- 
tion [951, l96[] that one can rewire the links in a (scale-free) network to obtain 
assortativity or disassortativity, to any degree, without any change in the 
degree distribution. 




5 10 15 20 5 10 15 20 

(a) k (Degree) (b) k (Degree) 



Figure 4.5: (a) Recovery of Degree correlation pattern by Type-II Random 
Controls of PCNs. (b) Recovery of Degree correlation pattern by LINs of 
Type-II Random Controls. 
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Figure 4.6: Histograms of 'Coefficient of Assortativity (r)' of (a) PCNs 
and (b) LINs (■) and their (Type II) Random Controls (□). 

4.6 Discussion 
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Our coarse-grained complex network model of protein structures uncovers, 
for the ffist time in a naturally evolved biological system, the interesting, 
and exceptional topological feature of assortativity. The assortative nature 
is found to be a generic feature of protein structures. 

Our discovery of assortativity in the amino acid networks in protein struc- 
tures questions the invoked generality of disassortativity property in natu- 
ral networks. By constructing appropriate random controls, we show (Fig- 
ures. 14.51 (a, b) and Figures. I4.6( a. b)) that degree distribution can partially 
explain the observed assortative nature of PCNs as well their LINs. Thus, 
this novel feature could be a reflection on the mechanisms of contact forma- 
tions in proteins while folding that have evolved through natural selection. 
An obvious question would be, "What are the processes by which a typical 
protein acquires a Gaussian-like degree distribution?" 

A large number of networks are shown [16] to have scale-free degree distribu- 
tions. The scale-free distribution, characterised by a power law, P{k) ~ k'"^, 
with a scaling exponent 7, is explained with the help of a growing network 
model with 'pre 



the network 



97| 



"erential attachment of the nodes,' which are being added to 
In addition to others' 0, l82|, our results on the degree 
distributions (Fig. 14.11) also show that the process underlying the formation 
of the PCNs does not follow the 'preferential attachment' mode. This is 
understandable as the PCNs differ from other networks in many aspects. 
PCNs are characterised by covalent backbone connectivity which constrains 
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the connectivity pattern. As opposed to other networks, PCNs evolve by 
changing the connectivity pattern through noncovalent contacts, while keep- 
ing the number of nodes constant 10|. Also in PCNs, steric hindrance limits 
the number of contacts an amino acid can have. All this could lead to the 
observed degree distribution in PCNs. 



From computational studies, it has been observed |28|, |95|] that assortative 
networks percolate easily, i.e., information gets easily transferred through 
the network as compared to that in disassortative networks. Protein folding 
is a cooperative phenomenon, and hence, communication amongst nodes is 
essential, so that appropriate noncovalent interactions can take place to form 
the stable native state structure 98|]. Thus percolation of information is very 
much essential and could lead to the observed cooperativity and fast folding 
of the proteins. Hence assortative mixing observed in proteins could be an 
essential prerequisite for facilitating folding of proteins. 

Disassortative mixing is observed in certain networks of biological origin such 
as metabolic signalling pathways network, and gene regulatory network jool . 
This disassortativity is conjectured to be responsible for decreasing the like- 
lihood of crosstalk between different functional modules of the cell, and in- 
creasing the overall robustness of a network by localising effects of deleterious 
perturbations. In contrast to these two networks, for the PCN one may put 
forward the possibility of the backbone chain connectivity as a means of 
conferring greater robustness against perturbations. It would also be inter- 
esting to study the role of "com muni ty structure" in conferring assortativity 
in these molecular networks 



26 



10G|. 



Here we have shown that the assortative mixing in PCNs and LINs is a 
generic feature of protein structures. Also the r values observed are quite high 
compared to other real- world networks (See Tables WTLl and H72|) . It may be 
pointed out that this is the first instance of the presence of assortative mi xing 
in a naturally occurring biological network, as all other networks studied 28 1 
(except for social networks) have been shown to exhibit disassortative mixing. 
The role, if any, the assortative nature of the protein contact networks may 
play in their kinetics of folding process is discussed in the next chapter. 



Chapter 5 

Correlation of Topological 
Parameters to the Rate of 
Folding of Two-State, 
Single-Domain Proteins 



5.1 Introduction 



In the previous chapter (Chapter H]), we showed that proteins, in general, 
are characterised by assortative mixing. Given that, in general, networks 
are known to be characterised by disassortative mixing 281], it brings forth 
an exceptional feature of proteins. It also calls for an explanation as to the 
purpose, if any, served by this special property. In this chapter we seek 
answer to this question. 

In Chapter [3] we also showed that clustering coefficients(C), which enumer- 
ate local compactness of PCNs, can not be used to distinguish proteins from 
each other (see the L-C plot in Fig. 13.21) . In reality, proteins are unique, 
function-specific and have biophysical properties which distinguish them de- 
spite structural similarities. Clearly the observed trend in characteristics 
path length (L)and clustering coefficient (C) are generic features that cap- 
ture their compact nature and ease of communication within the structures. 
This indicates that either the complex network studies are limited by the 
coarse-grained approach to draw conclusions about the specific functionali- 
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ties of protein or, that of their residues; or we need to look at different pa- 
rameter(s) that ca ptur e relevant features even at this level of coarse-graining. 



There is evidence |lOl| to suggest that coarse-graining indeed is a useful way 
of simplifying protein structure data while not losing the relevant biophysical 
details. Keeping this in view, in this chapter, we proceed to investigate both 
the PCNs and their long-range counterparts (LINs). 



Rate of folding and Native-state topology 

Though the proteins are comprised of thousands of atoms and hence po- 
tentially millions of inter-atomic interactions are possible, folding rates and 
mechanisms appear to be largely determined by the topology of the native 



(folded) state [6l|]. For network analyses and biophysical comparison we 
choose proteins that are structurally and kinetically simple. Hence we use 
single-domain, two-state folding prote ins f or our studies. Many geome trica l 



parameters viz. Contact Order ( CO) 102| . Long-range Order (LRO) 



1Q3|, 



Total Contact Distance (TCD) lOJ], that have been defined based on the 
native-state structure of the protein have been shown to have negatively cor- 
related with the rate of folding {Inlkp))- Contact Order, as well as LRO 
and TCD which are its variants, essentially measure the average sequence 
separation between residues that make contacts in the 3-D structure. The 
correlation is remarkable given that it holds over a million-fold range of fold- 
ing rates and for diverse structures. The observed negative correlation has 
been explained in terms of increased time needed to span the conformational 
space with increase in the value of these parameters. It is a reasonable ex- 
planation given that all these parameters enumerate the average normalised 
separation between residues those are in 'spatial contact' in the protein's 
native-state structure. 

This evidence was one of the two reasons we conjectured that our topology- 
based parameters may have bearing on rate of folding. The other was that our 
novel property of assortative mixing is i ndep endent of short-range contacts 



as shown in Chapter HI Zhou and Zhou 104| reported that "the accuracy of 
total contact distance in predicting folding rates is essentially unchanged if 
'short '-ranged contacts {\i — j\ < 14) are not included in the calculations". 
Given their observation we proceeded to check if the assortativity coefficient 
(r) could have a bearing on rate of folding. 
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5.2 Data 

The analyses was conducted with 30 single- domain, two-state folding, glob- 
ular proteins listed in Table Nos. 12.51 and 12.61 Table 15.11 provides the data 
for Figure 15. 1[ Table 15.21 provides the data for Figures 15. 2[ 15.31 15. 4[ and 
15. 5[ In Table 15.11 are listed the PDB IDs of the 30 single-domain two-state 
folding proteins that have been used in this study and the L and C of their 
corresponding PCNs and LINs. Table [5^ lists the coefficient of assortativity 
and clustering coefficients of PCNs and LINs as well as the rate of folding of 
these proteins. 

5.3 Clustering Coefficients of PCNs and LINs 

As shown in Chapter [3l here also we studied the L-C properties of the 30 
proteins under consideration. To study if the PCNs of the 30 proteins and 
their corresponding LINs have similar topological properties such as, char- 
acteristic path length (L) and clustering coefficient (C), we plotted the data 
of L and C from Table 15.11 in Fig. 15. 1[ The plot shows their corresponding 
Type I random controls. The Type II random controls were found to be 
indistinguishable from the Type I controls and not shown in Fig. 15.11 

The results indicate two major differences between the topological properties 
of the PCNs and their corresponding LINs. The PCNs of these proteins have 
high clustering coefficients (C > 0.55) compared to their random controls, 
whereas the LINs show distribution in C over a range (0.16 to 0.45) even 
though their random counterparts were almost indistinguishable from those 
of PCNs. L and C of random controls of PCNs were 2.168±0.11 & 0.1224± 
0.0284 and that of their LINs were 2.395±0.0699 & 0.0942±0.0178. The LINs 
also have marginally higher characteristic path lengths (4.379 ± 0.7677) than 
PCNs (3 ± 0.371) owing to their reduced number of contacts as compared to 
those in PCNs. 

Notice that these differences in Clins compared to that in Cpcns assign 
specificity to the network models of proteins which is other is otherwise miss- 
ing in CpcNS. 
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PDB ID 


Class 


LpCN 


CpcN 


Llin 


Clin 


IHRC 


a 


3.427 


0.5816 


5.5174 


0.2713 


IIMQ 


a 


3.0711 


0.6027 


5.2774 


0.1902 


lYCC 


a 


3.4574 


0.5693 


5.4283 


0.2766 


2ABD 


a 


3.22 


0.5882 


4.5447 


0.2289 


2PDD 


a 


2.4352 


0.6171 


4.9767 


0.1612 


lAEY 


/3 


2.5529 


0.6147 


3.6122 


0.3503 


ICS? 




2.7291 


0.5954 


3.8471 


0.3271 


IMJC 




2.7715 


0.5933 


4.0273 


0.3333 


INYF 




2.5983 


0.5987 


3.9734 


0.3457 


IPKS 


P 


2.7421 


0.5855 


3.9537 


0.3646 


ISHF 




2.6236 


0.6059 


4.2303 


0.3628 


ISHG 




2.5175 


0.5949 


3.6028 


0.3824 


ISRL 




2.5156 


0.5887 


3.6338 


0.382 


ITEN 




3.2824 


0.5738 


4.3011 


0.4309 


ITIT 


P 


3.0904 


0.552 


4.2515 


0.4519 


IWIT 




3.1346 


0.5753 


3.9589 


0.4488 


2AIT 




2.8297 


0.5922 


3.726 


0.4396 


3MEF 




2.7626 


0.5952 


3.9757 


0.3136 


lAPS 




3.1273 


0.5676 


3.9405 


0.4158 


ICIS 


ajS 


2.8154 


0.5892 


3.8424 


0.349 


ICOA 


ajS 


2.8358 


0.5768 


3.9559 


0.3901 


IFKB 


a(5 


3.319 


0.5821 


4.5636 


0.3626 


IHDN 


a(3 


2.8538 


0.5534 


3.779 


0.3243 


IPBA 


ajS 


3.2256 


0.5859 


4.7802 


0.2801 


lUBQ 


a[3 


3.0996 


0.6074 


4.9846 


0.3243 


lURN 


aP 


3.2529 


0.5864 


4.5465 


0.2971 


IVIK 


aP 


3.6199 


0.5849 


5.2247 


0.3138 


2HQI 


Q.(3 


2.5767 


0.5928 


3.3521 


0.3935 


2PTL 




3.972 


0.599 


6.982 


0.2541 


2VIK 


aP 


3.4179 


0.565 


4.575 


0.3095 



Table 5.1: The PDB IDs, L and C values for PCN and LIN of 30 single- 
domain two-state folding proteins of a, and aP class of proteins. 
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PDB ID 


Class 


^PCN 


C PCN 


fLIN 


C LIN 


ln{kp) 


Ref. 


IHRC 


ex 


0.1821 


0.5816 


0.3531 


0.2713 


8.76 


[105] 




IIMQ 


ot 


2586 


0.6027 


0.3675 


0.1902 


7.31 


[106 




lYCC 


ot 


0.2449 


5693 


0.3379 


0.2766 


9.62 


\m 


2ABD 


rv 


2874 


5882 


2736 


2289 


6 55 


vm] 




2PDD 


ct 


0.1436 


0.6171 


0.2616 


0.1612 


9.8 


[109] 


lAEY 


3 


0.1145 


0.6147 


0.3133 


3503 


2.09 


[110, 111] 


ICS? 


3 


2929 


0.5954 


0.4822 


0.3271 


6.98 


[70] 




IMJC 


3 


0.3027 


5933 


0.4823 


3333 


5.24 


\m 




INYF 


3 

y 


0.1752 


0.5987 


0.3432 


0.3457 


4.54 


m 




IPKS 


3 
y 


0.1872 


5855 


0.4269 


0.3646 


-1.05 


m 


ISHF 


3 
y 


0.1511 


6059 


0.4305 


3628 


4.55 


m 


ISHG 


3 
y 


0.1503 


0.5949 


2856 


0.3824 


1.41 


m 


ISRL 


3 
y 


0.2101 


0.5887 


0.2977 


0.382 


4.04 


[115J 


ITEN 


3 
y 


0.1645 


0.5738 


0.1949 


0.4309 


1.06 


M 




ITIT 


3 
y 


0.2048 


0.552 


0.2061 


0.4519 


3.47 


m 


IWIT 


3 
y 


0.0884 


0.5753 


0.1623 


0.4488 


0.41 


m 


2AIT 




0.1827 


0.5922 


0.3125 


0.4396 


4.2 


vm 




3MEF 




0.3359 


0.5952 


0.464 


0.3136 


5.3 


[103] 


lAPS 


a/3 


0.193 


0.5676 


0.2829 


0.4158 


-1.48 


[120] 




ICIS 


a/? 


0.2935 


0.5892 


0.384 


0.349 


3.87 


[m 




ICOA 


a/3 


0.2805 


0.5768 


0.4115 


0.3901 


3.87 


m 




IFKB 


a/3 


0.1704 


0.5821 


0.4006 


0.3626 


1.46 


[122] 


IHDN 


a/5 


0.1678 


0.5534 


0.3326 


0.3243 


2.7 


[123] 


IPBA 


a/5 


0.3228 


0.5859 


0.5789 


0.2801 


6.8 


[124] 




lUBQ 


a/5 


0.1782 


0.6074 


0.4414 


0.3243 


7.33 


[m 




lURN 


a/? 


0.3568 


0.5864 


0.4433 


0.2971 


5.76 


[126] 


IVIK 


a/? 


0.5191 


0.5849 


0.5649 


0.3138 


6.8 


\m 


2HQI 


a/? 


0.145 


0.5928 


0.1212 


0.3935 


0.18 


\m 


2PTL 


a/? 


0.5179 


0.599 


0.4072 


0.2541 


4.1 


[m 




2VIK 


a/? 


0.4144 


0.565 


0.437 


0.3095 


6.8 





Table 5.2: The r, C, of PCNs and LINs, and the corresponding rate of folding 
ln{kp) for 30 single- domain two-state folding proteins, a, /3, and a(3 class of 
proteins. 
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Figure 5.1: L-C plot for 30 single domain, two-state proteins: PCNs (□), 
LINs (O), Type I Random Controls of PCNs (•) and LINs(A). Error-bars 
in the random controls data indicate standard deviations in L and C for each 
protein computed over 100 instances. 

5.4 Correlation of protein network parame- 
ters to protein folding rates 

We have shown that even though the PCNs and their LINs differ in their 
clustering coefficients (C) (Fig. 15.11) . both show high coefficient of assorta- 
tivity (r) (Table [572]) with rpcN being marginally lower (0.2412 ± 0.1082) as 
compared to tun (0.36 ± 0.1102). We now study the correlation of the net- 
work parameters to the rate of folding Inlkp) of the corresponding proteins 
(Table [E2]). 

5.4.1 Coefficient of Assortativity and Rate of Folding 

Figure \^72\ shows the plot of rpcN with ln{kp). As seen in the figure, though 
there is a positive trend in the data, the correlation is poor. We find that 
the correlation coefficient for PCNs to be 0.3776 (p < 0.04). The correla- 
tion becomes better (0.5943; p < 0.005) after the five a proteins are not 
considered. 
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Figure 5.2: Plot of the rate of folding {Inikp)) versus the assortativity coef- 
ficient of PCNs {tpcns) of 30 proteins. 
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Figure 5.3: Plot of rate of folding, ln{kp). and the Coefficient of Assortativity 
of LINs {tlin) of the PCNs of the 30 proteins. The trend-line is shown as a 
dashed line. 
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Figure 5.4: Plot of the rate of folding, ln{kp) with Clustering Coefficient of 
PCNs {Cpcn). 

In Fig. 15.31 the rate of folding of the 30 proteins are plotted as a function 
of the coefficient of assortativity of their LINs. There is an increasing trend 
of ln{kp) with increase in r (correlation coefficient=0.4221; p < 0.02016). 
Though (3 and a(3 proteins show an increasing trend, the 5 a proteins have 
high Inlkp) values. The correlation coefficient between the rate of folding 
{ln(kp)) and r of their LINs, excluding the five a proteins, is 0.6981 {p < 
0.0005). This implies that, along with showing assortative mixing, the PCNs 
and particularly their LINs show significant positive correlations with the 
rate of folding. Thus the generic property of assortative mixing in proteins 
tend to contribute positively towards their kinetics of folding and is fairly 
independent of the short and long range of interactions. 

5.4.2 Average Clustering Coefficient and Rate of Fold- 
ing 

Figure [57il shows the plot of ln{kp) with the clustering coefficient of the PCNs 
{Cpcn) of the 30 proteins. As is obvious from the plot there is hardly any 
correlation between the two parameters. This is borne out by the correlation 
coefficient that we compute as —0.2437 {p < 0.2) 
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Figure 5.5: Plot of rate of folding, Inlkp), with Clustering Coefficient of 
LINs (Clin)- The trendline is indicated by a dashed line. 
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Figure 5.6: The plot of ln{kF), with Clin * kmax- 

Figure 15751 shows the plot of Inlkp) with the clustering coefficient of the LINs 
(Clin)- The In^kp) show high negative correlation (corr. coeffi = —0.7337; 
p < 0.0001) with the Cun for all the proteins. 

Clin enumerates number of triads made among the nodes of the Long-range 
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Interaction Network. Thus Clin essentially correlates to the number of 'dis- 
tant' amino acids (nodes), separated by a minimum of 12 or more other amino 
acids along the backbone, brought in mutual 'contact' with each other in the 
native state structure of the protein. Understandably, more the number of 
such long-range mutual contacts required to be made in order to achieve the 
native state, more is the time taken to fold, and hence slower is the rate of 
folding. 

The calculation of clustering coefficient is dependent on the degree of the 
node. In Figure [5^ we scale Cun^ with kmax as Cun * kmax and plot it 
with ln{kp). We find that, after scaling, the correlation coefficient improves 
to —0.7712 {p < 0.0001). Thus Clins show significantly negative correla- 
tion with single- domain, two-state folding proteins-a property completely 
neutralised in PCNs. 



5.5 Discussion 

Here we have studied two important network parameters {C and r) of 30 
single-domain two-state folding proteins at two length-scales-PCNs and the 
LINs. The results show that even though the PCNs show "small-world" 
property in their L-C plot, the Cun^ have comparatively low values and are 
distributed over a range between the random controls and PCNs. We have 
studied the correlation of two widely-used topological parameters (assorta- 
tivity and clustering) to the kinetics of folding at different length scales. 

A 'positive' correlation of Inlkp) with r (Coefficient of Assortativity) is an 
important feature of this network as the property is maintained for both 
PCNs and LINs. Thus, it is a generic feature of proteins that needs fast net- 
work transmission of information for functional versatility in the cell. Apart 
from helping in fast folding, assortative mixing, with its role in percolation 
of information, could also be important for allostery and signalling in pro- 
teins that require transfer of binding information among different parts of 
the protein for further function. It may be noted that 'positive' correlation 
of ln{kF) with r is an exceptional feature of coe fficient of assortativity as 



all other measures described so far 



102 



103 



[lQ4| 



have been known to have 



a negative correlation. Given the genetic basis and mode of formation of 
protein chains, the signature of assortativity as an indicator to the rate of 
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folding is clear. It will be interesting to see which physico-chemical factors 
could be responsible for a positive correlation with r, thus speeding up the 
rate of folding with increasing assortative mixing of the proteins. 

Most networks having high degree of clustering consist of nodes such that any 
two neighbours of these nodes have a high probability of themselves being 
linked to each other. The PCNs have been shown to have high degree of 
clustering, which contributes to their small-world nature helping in efficient 
and effective dissipation of energy needed in their function [sl, Q]. Though the 
LINs have significantly lower clustering coefficients than their PCNs (Fig. l5.1( ) 
they show (Fig. 15.51) a negative correlation with the rate of folding of the 
proteins. This indicates that clustering of amino acids that participate in 
the long-range interactions, into 'cliques' slows down the folding process. 
However, the clustering coefficient of PCNs does not have any significant 
correlation to the rate of folding, indicating that the short-range interactions 
may be playing a constructive and active role in the determination of the rate 
of the folding process by reducing the negative contribution of the LINs. Our 
results clearly show that the separation of the types of contacts in the PCNs 
and LINs clearly delineate the length scale of contacts that play crucial role 
in protein folding. 



Chapter 6 



Conclusions 



After the synthesis in the cell, folding of the amino acid chain is important 
for attaining the structure required to reach a functional state as soon as 
possible. This happens through the formation of short- as well as long-range 
interactions. While the former are largely responsible for formation of sec- 
ondary structure units, The latter bring spatially distant (along the chain) 
residues closer. Secondary and tertiary structures are formed primarily by 
noncovalent interactions. Our graph theoretical representations of proteins 
structure. Proteins Contact Network (PCN) and Long-range Interaction Net- 
work (LIN), model various aspects of the three-dimensional structure of a 
protein in an attempt to understand it's function and kinetics. 

The Small World Nature 

We found that proteins of diverse structural and functional classification have 
small-world nature with low characteristic path length (L) and high level of 
clustering (C) as shown in Figure 16.11 In this regard PCNs are similar to 
most other real-world networks. Interestingly, we find that LINs depart from 
the small-world nature. The LINs have medium range of C in the proteins 
studied. The implication of small-world nature of PCNs is attributed to the 
case of dissipation of energy upon complexation. Such a property may have 
important role in efficient allosteric regulation of protein functions. 
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Regular Control 



PCN 



Random Control 






p = o 



Increasing randomness 



■> p=l 



Figure 6.1: PCN and Random & Regular control of 2PDD. Comparison with 
Watts- Strogatz model. 

Hierarchy, Modularity, and Community structure 



We find that PCNs are characterised by hierarchical nature, as shown by 
the independence of their clustering coefficient with size (Figure 13.1 OjK This 
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observation is in accord with earlier findings by other researchers "[4 
and more investigation needs to be done on this line. We would like to point 
out that it has been found in many real networks 32] the hierarchy and mod- 
ular architecture go hand in hand. Our preliminary results suggest modular 
architecture in PCNs. It would be interesting to see what significance, if 
any, the modules thus found in protein structures would have. Such modules 
could be identified by the network community structure algorithms. 



Assortative Nature of PCNs and LINs 



In contrast to all other naturally evolved intracellular networks studied so 
far, we found that contact networks of proteins show assortative mixing at 
both short and long length scales i.e. rich nodes tend to connect to other 
rich nodes. This is an exceptional property as all other real-world networks 
known (except for social networks) are disassortative. Interestingly, we find 
that LINs too are assortative, which implies that assortativity is independent 
of short-range interactions. We built appropriate random controls to identify 
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the appropriate network feature that possibly contributes towards assorta- 
tivity. We found that degree distribution contributes significantly towards 
assortative mixing in PCNs as well as their LINs. 

The predominance of disassortativity in real-world networks have been al- 
luded to confer the property of robustness (reduced spread of perturbations) 
in the network. Then why are the contact networks of protein structures 
assortative? Communication among the residues of the protein is important. 
It is known that "network of residues" mediate allosteric communication in 
proteins Tj, l73|] . It is also proposed that allostery is an intrinsic property of 



all dynamic proteins |129l |. We propose that assortativity is an indicator of 
'allosteric communication network' established within the protein structure 
and is important enough to be found in all proteins. 

The role of specific residues in p r otein folding and their evolutionary conser- 
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130 



13l| . Mirny et al. [6J] found that conserved 



vation is highly debated 
residues that are part of folding nucleus, across proteins, were found to be in 
contact with each other. Based on this finding, we propose that folding nu- 
cleus of a protein could be a subset of the set of residues that form assortative 
group. 

Our observation of assortative mixing and hence a set of residues that are 
part of a assortative network opens up new directions of work. 



Biophysical implication of topological parameters 

One would expect to have biophysical implications of the exceptional net- 
work properties that we observe. We found that for both PCNs and LINs, 
coefficient of assortativity, a measure of the assortativity, has positive corre- 
lation with the rate of folding of single- domain, two-state folding proteins. 
Similarly, we find that clustering coefficient of LINs has a high negative cor- 
relation to the rate of folding of these proteins, though that of PCNs show 
no significant correlation. Other workers have developed parameters specific 
for proteins (CO, LRO, TCD) and correlated with rate of folding. Our aim 
was to show the relevance of general network parameters to a kinetic prop- 
erty of the proteins. Indices such as closeness, betweenness offer more local 
and hence residue-specific information. By combining our general, global pa- 
rameters with such local ones one could address broader questions related to 
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protein structure and function. 

Advantages and Limitations of PCN Model 

PCN is, by virtue of coarse-graining, a simple model. It doesn't involve 
the time evolution of the protein structure. Rather it models the static 
native state structure. We don't explicitly consider information about the 
chemical nature of the side-chains in this model. Although, since the final 
native-state structure is an outcome of the chemical interactions happening 
among various amino acids the model implicitly does consider the chemical 
interactions involved. 

Each of the twenty amino acids has different numbers of atoms and hence 
has different size. Hence the nature of noncovalent contacts depends on the 
specific amino acids involved. We haven't included that information in our 
model so far. 

The time evolution of the protein structure can be considered by building 
the weighted network using the Transition State Ensemble (TSE) struc- 
tures. Depending on the question being asked and its sensitivity to the 
above-mentioned details one may consider adding further details to the PCN, 
thereby enhancing it. 

Thus complex network analyses offers to be an important tool in studying 
the structure-function of proteins-the fascinating molecule of life. 



Appendix A 



Pseudocode of the Algorithms 
Implemented 

In this Appendix we list pseudocodes of some of the important algorithms 
we implemented for the complex network analyses of the proteins structures. 
Algorithm lA.O.ll lists various experiments to be performed on a single pro- 
tein structure. Following is a list of frequently used variables across all the 
algorithms. 

rir Number of nodes in the network 
He Number of links in the network 
Adj The rirXUr Adjacency Matrix 



Algorithm A. 0.1: CoMPLExNETWORKsANALYSls(PL)i?F2/e) 



comment: The 'main' function for Complex Network analyses. 

ANALYSE-PCN(PDi?Fz/e) 
ANALYSE-PCN-LlN(ylc(7') 

Analyse- RANDOMCoNTROLTYPEl(nr,nc) 
Analyse- RANDOMCoNTROLTYPEl-LIN(Arfj) 
Analyse- RANDOMCoNTROLTYPElI(A(i7) 
Analyse- RandomControlTypeII-LIN (Adj) 
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Algorithm A.0.2: ANALYSEPcN{PDBFile) 



comment: Function for analyses of Protein Contact Network. 

GETCooRmNATEs{P DBFile) 

return {XYZ, rir) 
GetAdjacencyMatrix(XFZ, n^, Re) 

return {Distances, Adj, ric) 
GETDEGREE(A(ij, Ur) 

return {Degree, kmax) 
GETDEGREEDiSTRiBUTiON(£)eg'ree,nr) 
return {DegreeDist) 

GETDEGREECORRELATIONS(L'e5'ree, n^) 

return (DegreeCorrel aiion s) 



Algorithm A.0.3: GETADJACENCYMATRix(XyZ, n^, -Rc) 



comment: Function for construction of adjacency matrix. 

procedure ComputeDist(XFZ, j) 
Dist ^ ^{Tl=i{XYZ{i, k) - XYZ{j, k)f) 
return {Dist) 

Adj ^ 
Dist ^ 

for i 1 to — 1 
do for J i + 1 to 

Dist = COMPUTEDlST(XyZ, i, j); 
if {Dist <= RcX) 

' Adj{i,j) ^ 1; 
Adj{j,i) ^ 1 
Distance{i, j) <— Dist; 
Distance{j, i) <— Dist 



do < 



then < 



return {Adj, Dist) 
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Algorithm A.0.4: GETDEGREE(/l(ij, rir) 

Degree{nr, 2) <— 
for i <— 1 to 
do ^Degree{i, 1) ^ 1 

for i ^ 1 to nr 
do for j ^ 1 to rir 
do ^Degree{i, 2) Degree{i, 2) + 1 

DegreeMax GetDegreeM AXlMUM(£)e5iree) ; 
return {Degree, DegreeMax) 



Algorithm A.0.5: GETDEGREEDlSTRlBUTlON(L)e5free, n^) 



DegreeDistribution{l : n^, 1 : 2) <— 
DegreeDistributionNorm{l : n^, 1 : 2) 
for i <— 1 to 
^ f DegreeDistribution{i, 1) i 

\DegreeDistributionNorm{i, 1) ^ i 

for i <— 1 to rir 

^ I DegreeDistribution{Degree{i, 2), 2) 
\DegreeDistribution{Degree{i, 2), 2) + 1 

MaxDegreeDist GETMAx(()De5(ree£)istri6wtion(:, 2)) 
DegreeDistributionNorm{:,2) <— 
DegreeDistribution{:, 2) /MaxDegreeDist 

return {DegreeDistribution, DegreeDistributionNorm) 
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Algorithm A.0.6: GETDEGREECoRRAvG(De5(ree, Arfj, n^) 

DegreeCorr{l : rir, 1 : 2) ^ 
DegreeCorrAvg{l : n^, 1 : 2) <— 
for i ^ 1 to rir 

' DegreeCorr(i, 1) = Degree(i,2) 
for j ^ 1 to rir 

'if {Adj{t,j)== 1) 

then ^DegreeCorr{i, 2) = DegreeCorr{i, 2) + Degree{j, 2) 



do < 



do 



for i ^ 1 to Ur 

{DegreeCorrAvg{i, 1) ^ i 
DegreeCorrAvg{DegreeCorr{i, 1), 2) ^ 
DegreeCorrAvg{DegreeCorr{i, 1), 2) + DegreeCorr{i, 2) 



for i ^ 1 to 

'if {DegreeDist{i,2)/ = 0) 

then I^DegreeCorrAvg{i, 2) = DegreeCorrAvg{i, 2)/DegreeDist{j, 2) 



do 



return {DegreeCorr, DegreeCorrAvg) 
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Algorithm A. 0.7: GETCoEFFAssORTATlviTY(De5free, Adj, n^) 



TermOl ^ 
Term02 ^ 
Term03 
for i ^ 1 to Ur 
do 

for J <— 1 to n,. 
do 



{TermOl ^ TermOl + {Degree(i, 2) * Degree{j, 2)) 
Term02 ^ Term02 + {Degree{i, 2) * Degree{j, 2)) 
TermOS TermOS + {Degree{i, 2)^ + Degree{j, 2)^) 



fif 



1) 



TotalEdges = SUM(()Adj) 




remp01-Temp02 
Temp03-Temp02 



return {Coef fAssortativity) 
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Algorithm A.0.8: GETCLUSTERlNGCoEFF(De5free, Adj, Ur) 



ClusteringCoef f{l : n^, 1 : 2) <— 
AvgCC ^ 
for i ^ 1 to Tlr 
do 

ClusteringCoef f{i, 1) <— i 
for i <— 1 to 
do 

for j ^ 1 to Tlr 

do 

if {Degree{i,2) >= 2) 

{ClusteringCoef f{i, 2) ^ ClusteringCoef f{i, 2) + 
{Adj{iJ) * Adj{i, k) * Adj{j, k)) 



then 



if {Degree{i,2) >= 2) 
[ then {cluster^ngCoeff{^,2)^^,^^^^§^^ 

for i ^ 1 to Ur 
do 

^^AvgCC = AvgCC + ClusteringCoef f{i, 2) 
AvgCC = AvgCC /nr 



return {ClusteringCoef f, AvgCC) 
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Algorithm A.0.9: GETCLUSTERlNGCoEFFDlST(C/Msterm5fCoe//) 



ClusteringCoeffDist{l : 10, 1 : 2) ^ 
for i ^ 1 to 10 
do ^ClusteringCoef fDist{i, 1) = i 



for i ^ 1 to rir 
do 

''if ClusteringCoeff{i,2) >= & ClusteringCoef f{i,2) < 0.1 
ClusteringCoef fDist{l,2) = ClusteringCoef fDist{l, 2) + 1 



else if ClusteringCoef f 
ClusteringCoef fDist{2, 2 

else if ClusteringCoef f 
ClusteringCoef fDist{3, 2 

else if ClusteringCoef f 
ClusteringCoef fDist{4:, 2 

else if ClusteringCoef f 
ClusteringCoef f Distib, 2 

else if ClusteringCoef f 
ClusteringCoef fDist{6, 2 

else if ClusteringCoef f 
ClusteringCoef f Distil , 2 

else if ClusteringCoef f 
ClusteringCoef f DistiS, 2 

else if ClusteringCoef f 
ClusteringCoef f Disti^, 2 



i,2) >= 0.1 & ClusteringCoef f{i, 2) < 0.2 
= ClusteringCoef fDist{2, 2) + 1 

i, 2) >= 0.2 & ClusteringCoef f{i, 2) < 0.3 
= ClusteringCoef fDist{3, 2) + 1 

i,2) >= 0.3 & ClusteringCoef f{i, 2) < 0.4 
= ClusteringCoef fDist{4, 2) + 1 

i, 2) >= 0.4 & ClusteringCoef f{i, 2) < 0.5 
= ClusteringCoef f Distib, 2) + 1 

i,2) >= 0.5 & ClusteringCoef f{i, 2) < 0.6 
= ClusteringCoef fDist{6, 2) + 1 

i,2) >= 0.6 & ClusteringCoef f{i, 2) < 0.7 
= ClusteringCoef f Distil, 2) + 1 

i,2) >= 0.7 & ClusteringCoef f{i, 2) < 0.8 
= ClusteringCoef fDist{8, 2) + 1 

i,2) >= 0.8 & ClusteringCoef fii, 2) < 0.9 
= ClusteringCoef fDist{9, 2) + 1 



else if ClusteringCoef f{i, 2) >= 0.9 & ClusteringCoef f{i, 2) < 1.0 



86 



Algorithm A.0.10: GETTYPElRANDOMCoNTROL(nr, nc) 

adjTypeI{l : n^, 1 : rir) <— 
RandomEdges = Uc — {rir — 1) 
for i ^ 1 to Ur — 1 
do 

{adjTypeI{i, i + 1) ^ 1 
adjTypeI{i + 1, i) ^ 1 

RandomEdgesC ounter = 
while RandomEdgesC ounter < RandomEdges 
' iRan = RANDOMNUMBER(iS'ee(i); 
jRan = RANDOMNuMBER(j5'ee(i); 
i <— iRan * {rir — 1) + 1 
do { i ^ jRan * (n^ - 1) + 1 

if {i 7^ j && |i — j| 7^ 1 && adjTypeI{i, j) ^ 1) 

then l^^^'^yP^^^^'^^^^ 
\adjTypeI{j,i) ^ 1 

return (adjTypel) 



Algorithm A. 0.11: GETLlN(ao?j, n,.) 



for i 1 to rir — 1 
do for j <— i + 1 to rir 

'if iadj{i,j) ^^1 kk j + 1 kk j <^i + LRIThreshold) 

^ then 



return {adj) 
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Algorithm A.0.12: GETTYPElIRANDOMCoNTROL(ao?j, Ur, Uc, EdgeRewirings) 



EdgeRewiring sC ounter = 

while EdgeRewiring sC ounter < EdgeRewirings 
' iRan = RANDOMNUMBER(iS'ee(i); 
jRan = RANDOMNUMBER(jS'ee(i); 
pRan = RANDOMNUMBER(pS'ee(i); 
qRan = RANDOMNUMBER(g5'ee(i); 
i iRan * {nr — 1) + 1 
j ^ jRan * {nr — 1) + 1 
p ^ pRan * {nr — 1) + 1 
q qRan * {nr — 1) + 1 

while adj{i,j) == || adj{p,q) == || adj{i,q) == 1 
adj{p,j) == 1 1 1 |i — j| < 2 1 1 |p — g| < 2 1 1 \i — q\ < 2 \ 
\p-j\ < 2 

iRan = RANDOMNuMBER(iS'eed); 
do ( jRan = RANDOMNl]M'BER{j Seed); 

pRan = RANDOMNuMBER(p5'eec?); 
qRan = RANDOMNuMBER(g5'ee(i); 
i <— iRan * {nr — 1) + 1 
j ^ jRan * {nr — 1) + 1 
p ^ pRan * {nr — 1) + 1 



do < 



q qRan * {nr — 1) + 1 



adj{i,j) 0;adj{j,i) <- 0; 
adj{p, q) ^ 0; adj{q,p) ^ 0; 
adj{i,q) ^ l;adj{q,i) ^ 1; 
adj{p,j) ^ l]adj{j,p) ^ 1; 



EdgeRewiring sC ounter <— EdgeRewiring sC ounter + 1 



return {adj) 
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